git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* git-p4 crashes on non UTF-8 output from p4
@ 2021-04-08 19:28 Tzadik Vanderhoof
  2021-04-09 15:38 ` Torsten Bögershausen
  0 siblings, 1 reply; 24+ messages in thread
From: Tzadik Vanderhoof @ 2021-04-08 19:28 UTC (permalink / raw)
  To: git

When git-p4 reads the output from a p4 command, it assumes it will be
100% UTF-8. If even one character in the output of one p4 command is
not UTF-8, git-p4 crashes with:

File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
    value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
decode byte Ox93 in position 42: invalid start byte

I'd like to make a pull request to have it try another encoding (eg
cp1252) and/or use the Unicode replacement character, to prevent the
whole program from crashing on such a minor problem.

This is especially a problem on the "git p4 clone" command with @all,
where git-p4 needs to read thousands of changeset descriptions, one of
which may have a stray smart quote, causing the whole clone operation
to fail.

Sound ok?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: git-p4 crashes on non UTF-8 output from p4
  2021-04-08 19:28 git-p4 crashes on non UTF-8 output from p4 Tzadik Vanderhoof
@ 2021-04-09 15:38 ` Torsten Bögershausen
  2021-04-11  7:16   ` Tzadik Vanderhoof
  0 siblings, 1 reply; 24+ messages in thread
From: Torsten Bögershausen @ 2021-04-09 15:38 UTC (permalink / raw)
  To: Tzadik Vanderhoof; +Cc: git

On Thu, Apr 08, 2021 at 12:28:25PM -0700, Tzadik Vanderhoof wrote:
> When git-p4 reads the output from a p4 command, it assumes it will be
> 100% UTF-8. If even one character in the output of one p4 command is
> not UTF-8, git-p4 crashes with:
>
> File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
>     value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
> decode byte Ox93 in position 42: invalid start byte
>
> I'd like to make a pull request to have it try another encoding (eg
> cp1252) and/or use the Unicode replacement character, to prevent the
> whole program from crashing on such a minor problem.
>
> This is especially a problem on the "git p4 clone" command with @all,
> where git-p4 needs to read thousands of changeset descriptions, one of
> which may have a stray smart quote, causing the whole clone operation
> to fail.
>
> Sound ok?

Welcome to the Git community.
To start with: I am not a git-p4 expert as such, but seeing that a program is crashing
is never a good thing.
All efforts to prevent the crash are a step forward.

As you mention cp1252 (which is more used under Windows), there are probably lots of
system out there which use ISO-8859-15 (or ISO-8859-1) we may have the first whish:

Make the encoding/fallback configurable.
Let people choose if they want a crash (if things are broken),
fallback to cp1252 or one of the other ISO-ISO-8859-x encodings.

In that sense: we look forward to a pull-request.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: git-p4 crashes on non UTF-8 output from p4
  2021-04-09 15:38 ` Torsten Bögershausen
@ 2021-04-11  7:16   ` Tzadik Vanderhoof
  2021-04-11  9:37     ` Torsten Bögershausen
  0 siblings, 1 reply; 24+ messages in thread
From: Tzadik Vanderhoof @ 2021-04-11  7:16 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: git

Here is the pull request:

From 8d234af842223dceae76ce0affd3bbb3f17bb6d9 Mon Sep 17 00:00:00 2001
From: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
Date: Sat, 10 Apr 2021 22:41:39 -0700
Subject: [PATCH] add git-p4.fallbackEncoding config variable, to prevent
 git-p4 from crashing on non UTF-8 changeset descriptions

---
 Documentation/git-p4.txt | 10 ++++++++++
 git-p4.py                | 11 ++++++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
index f89e68b..71f3487 100644
--- a/Documentation/git-p4.txt
+++ b/Documentation/git-p4.txt
@@ -638,6 +638,16 @@ git-p4.pathEncoding::
  to transcode the paths to UTF-8. As an example, Perforce on Windows
  often uses "cp1252" to encode path names.

+git-p4.fallbackEncoding::
+    Perforce changeset descriptions can be in a mixture of encodings. Git-p4
+    first tries to interpret each description as UTF-8. If that fails, this
+    config allows another encoding to be tried.  The default is "cp1252".  You
+    can set it to another encoding, for example, "iso-8859-5". If instead of
+    an encoding, you specify "replace", UTF-8 will be used, with invalid UTF-8
+    characters replaced by the Unicode replacement character. If you specify
+    "none", there is no fallback, and any non UTF-8 character will cause
+    git-p4 to immediately fail.
+
 git-p4.largeFileSystem::
  Specify the system that is used for large (binary) files. Please note
  that large file systems do not support the 'git p4 submit' command.
diff --git a/git-p4.py b/git-p4.py
index 09c9e93..18d02b4 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b',
cb=None, skip_info=False,
                 for key, value in entry.items():
                     key = key.decode()
                     if isinstance(value, bytes) and not (key in
('data', 'path', 'clientFile') or key.startswith('depotFile')):
-                        value = value.decode()
+                        try:
+                            value = value.decode()
+                        except:
+                            fallbackEncoding =
gitConfig("git-p4.fallbackEncoding").lower() or 'cp1252'
+                            if fallbackEncoding == 'none':
+                                raise
+                            elif fallbackEncoding == 'replace':
+                                value = value.decode(errors='replace')
+                            else:
+                                value = value.decode(encoding=fallbackEncoding)
                     decoded_entry[key] = value
                 # Parse out data if it's an error response
                 if decoded_entry.get('code') == 'error' and 'data' in
decoded_entry:
-- 
2.31.1.windows.1

On Fri, Apr 9, 2021 at 8:38 AM Torsten Bögershausen <tboegi@web.de> wrote:
>
> On Thu, Apr 08, 2021 at 12:28:25PM -0700, Tzadik Vanderhoof wrote:
> > When git-p4 reads the output from a p4 command, it assumes it will be
> > 100% UTF-8. If even one character in the output of one p4 command is
> > not UTF-8, git-p4 crashes with:
> >
> > File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
> >     value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
> > decode byte Ox93 in position 42: invalid start byte
> >
> > I'd like to make a pull request to have it try another encoding (eg
> > cp1252) and/or use the Unicode replacement character, to prevent the
> > whole program from crashing on such a minor problem.
> >
> > This is especially a problem on the "git p4 clone" command with @all,
> > where git-p4 needs to read thousands of changeset descriptions, one of
> > which may have a stray smart quote, causing the whole clone operation
> > to fail.
> >
> > Sound ok?
>
> Welcome to the Git community.
> To start with: I am not a git-p4 expert as such, but seeing that a program is crashing
> is never a good thing.
> All efforts to prevent the crash are a step forward.
>
> As you mention cp1252 (which is more used under Windows), there are probably lots of
> system out there which use ISO-8859-15 (or ISO-8859-1) we may have the first whish:
>
> Make the encoding/fallback configurable.
> Let people choose if they want a crash (if things are broken),
> fallback to cp1252 or one of the other ISO-ISO-8859-x encodings.
>
> In that sense: we look forward to a pull-request.



-- 
Tzadik

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: git-p4 crashes on non UTF-8 output from p4
  2021-04-11  7:16   ` Tzadik Vanderhoof
@ 2021-04-11  9:37     ` Torsten Bögershausen
  2021-04-11 20:21       ` Tzadik Vanderhoof
  0 siblings, 1 reply; 24+ messages in thread
From: Torsten Bögershausen @ 2021-04-11  9:37 UTC (permalink / raw)
  To: Tzadik Vanderhoof; +Cc: git

On Sun, Apr 11, 2021 at 12:16:25AM -0700, Tzadik Vanderhoof wrote:
> Here is the pull request:

Thanks for the work. Some comments inline.

>
> From 8d234af842223dceae76ce0affd3bbb3f17bb6d9 Mon Sep 17 00:00:00 2001
> From: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
> Date: Sat, 10 Apr 2021 22:41:39 -0700


The subject should be one short line, highlighting what this is all about,
followed by a blank line and a longer description about the problem and
the solution. The original description was good, see below.

> Subject: [PATCH] add git-p4.fallbackEncoding config variable, to prevent
>  git-p4 from crashing on non UTF-8 changeset descriptions

In that sense I make a first trial here, subject for improvements:


Subject: [PATCH] Add git-p4.fallbackEncoding config variable

When git-p4 reads the output from a p4 command, it assumes it will be
100% UTF-8. If even one character in the output of one p4 command is
not UTF-8, git-p4 crashes e.g. with:

File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
    value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
decode byte Ox93 in position 42: invalid start byte

Allow to try another encoding (eg cp1252) and/or use the
Unicode replacement character  to prevent the whole program from crashing
on such a "minor" problem.

This is especially a problem on the "git p4 clone" command with @all,
where git-p4 needs to read thousands of changeset descriptions, one of
which may have a stray smart quote, causing the whole clone operation
to fail.

Introduce "git-p4.fallbackEncoding" to handle non UTF-8 encodings, if needed.

>
> ---
>  Documentation/git-p4.txt | 10 ++++++++++
>  git-p4.py                | 11 ++++++++++-
>  2 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
> index f89e68b..71f3487 100644
> --- a/Documentation/git-p4.txt
> +++ b/Documentation/git-p4.txt
> @@ -638,6 +638,16 @@ git-p4.pathEncoding::
>   to transcode the paths to UTF-8. As an example, Perforce on Windows
>   often uses "cp1252" to encode path names.
>
> +git-p4.fallbackEncoding::
> +    Perforce changeset descriptions can be in a mixture of encodings. Git-p4
> +    first tries to interpret each description as UTF-8. If that fails, this
> +    config allows another encoding to be tried.  The default is "cp1252".  You

I know that cp1252 is attractive to be used, especially for Windows installations that
use Latin-based "characters".
But: If we introduce a new config-variable into Git, the default tends to be
"if not set to anything, behave as the old Git".

> +    can set it to another encoding, for example, "iso-8859-5". If instead of
ISO-8859-5 may be more portable on the different i18 implementations
than the lower-case spelling.

> +    an encoding, you specify "replace", UTF-8 will be used, with invalid UTF-8
> +    characters replaced by the Unicode replacement character. If you specify
> +    "none", there is no fallback, and any non UTF-8 character will cause
> +    git-p4 to immediately fail.

As said, before, many people may expect Git to fail, so that the default should be
none to avoid surprises.
When a "non-UTF-8-clean" repo is handled, they want to know it.

> +
>  git-p4.largeFileSystem::
>   Specify the system that is used for large (binary) files. Please note
>   that large file systems do not support the 'git p4 submit' command.
> diff --git a/git-p4.py b/git-p4.py
> index 09c9e93..18d02b4 100755
> --- a/git-p4.py
> +++ b/git-p4.py
> @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b',
> cb=None, skip_info=False,
>                  for key, value in entry.items():
>                      key = key.decode()
>                      if isinstance(value, bytes) and not (key in
> ('data', 'path', 'clientFile') or key.startswith('depotFile')):
> -                        value = value.decode()
> +                        try:
> +                            value = value.decode()
> +                        except:
> +                            fallbackEncoding =
> gitConfig("git-p4.fallbackEncoding").lower() or 'cp1252'
> +                            if fallbackEncoding == 'none':
> +                                raise

Would it make sense to tell the user about the new config value here?
 raise Exception("Non UTF-8 detected. See git-p4.fallbackEncoding"
Or somewhat in that style ?

> +                            elif fallbackEncoding == 'replace':
> +                                value = value.decode(errors='replace')
> +                            else:
> +                                value = value.decode(encoding=fallbackEncoding)
>                      decoded_entry[key] = value
>                  # Parse out data if it's an error response
>                  if decoded_entry.get('code') == 'error' and 'data' in
> decoded_entry:


Did I miss the Signed-off-by here?

Please have a look here:
https://git-scm.com/docs/SubmittingPatches

(or look at Documentation/SubmittingPatches in your git source code)

And finally: Thanks for the contribution.
Is there any chance to add test-cases, to make sure that this feature
is well-tested now and in the future ?


> --
> 2.31.1.windows.1
>
> On Fri, Apr 9, 2021 at 8:38 AM Torsten Bögershausen <tboegi@web.de> wrote:
> >
> > On Thu, Apr 08, 2021 at 12:28:25PM -0700, Tzadik Vanderhoof wrote:
> > > When git-p4 reads the output from a p4 command, it assumes it will be
> > > 100% UTF-8. If even one character in the output of one p4 command is
> > > not UTF-8, git-p4 crashes with:
> > >
> > > File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
> > >     value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
> > > decode byte Ox93 in position 42: invalid start byte
> > >
> > > I'd like to make a pull request to have it try another encoding (eg
> > > cp1252) and/or use the Unicode replacement character, to prevent the
> > > whole program from crashing on such a minor problem.
> > >
> > > This is especially a problem on the "git p4 clone" command with @all,
> > > where git-p4 needs to read thousands of changeset descriptions, one of
> > > which may have a stray smart quote, causing the whole clone operation
> > > to fail.
> > >
> > > Sound ok?
> >
> > Welcome to the Git community.
> > To start with: I am not a git-p4 expert as such, but seeing that a program is crashing
> > is never a good thing.
> > All efforts to prevent the crash are a step forward.
> >
> > As you mention cp1252 (which is more used under Windows), there are probably lots of
> > system out there which use ISO-8859-15 (or ISO-8859-1) we may have the first whish:
> >
> > Make the encoding/fallback configurable.
> > Let people choose if they want a crash (if things are broken),
> > fallback to cp1252 or one of the other ISO-ISO-8859-x encodings.
> >
> > In that sense: we look forward to a pull-request.
>
>
>
> --
> Tzadik

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: git-p4 crashes on non UTF-8 output from p4
  2021-04-11  9:37     ` Torsten Bögershausen
@ 2021-04-11 20:21       ` Tzadik Vanderhoof
  2021-04-12  4:06         ` Torsten Bögershausen
  0 siblings, 1 reply; 24+ messages in thread
From: Tzadik Vanderhoof @ 2021-04-11 20:21 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: git

Thank you for your excellent and friendly feedback!

I understand everything you said, but I have a question about the unit
test you requested.  The git-p4.py script currently does not have
tests and is not written in a way that would be testable.  (The Python
function I modified calls into the shell and requires a valid Perforce
installation.)

Would you prefer I  a) refactor the code to be testable and then write
tests  or b) skip the unit testing (not sure if there are any further
options)?

(For option a) I would break out the part of the function I modified
into another function and then call my new function for my testing.
(I guess it would be better to break the refactoring and my changes
into 2 separate commits.)

On Sun, Apr 11, 2021 at 2:37 AM Torsten Bögershausen <tboegi@web.de> wrote:
>
> On Sun, Apr 11, 2021 at 12:16:25AM -0700, Tzadik Vanderhoof wrote:
> > Here is the pull request:
>
> Thanks for the work. Some comments inline.
>
> >
> > From 8d234af842223dceae76ce0affd3bbb3f17bb6d9 Mon Sep 17 00:00:00 2001
> > From: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
> > Date: Sat, 10 Apr 2021 22:41:39 -0700
>
>
> The subject should be one short line, highlighting what this is all about,
> followed by a blank line and a longer description about the problem and
> the solution. The original description was good, see below.
>
> > Subject: [PATCH] add git-p4.fallbackEncoding config variable, to prevent
> >  git-p4 from crashing on non UTF-8 changeset descriptions
>
> In that sense I make a first trial here, subject for improvements:
>
>
> Subject: [PATCH] Add git-p4.fallbackEncoding config variable
>
> When git-p4 reads the output from a p4 command, it assumes it will be
> 100% UTF-8. If even one character in the output of one p4 command is
> not UTF-8, git-p4 crashes e.g. with:
>
> File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
>     value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
> decode byte Ox93 in position 42: invalid start byte
>
> Allow to try another encoding (eg cp1252) and/or use the
> Unicode replacement character  to prevent the whole program from crashing
> on such a "minor" problem.
>
> This is especially a problem on the "git p4 clone" command with @all,
> where git-p4 needs to read thousands of changeset descriptions, one of
> which may have a stray smart quote, causing the whole clone operation
> to fail.
>
> Introduce "git-p4.fallbackEncoding" to handle non UTF-8 encodings, if needed.
>
> >
> > ---
> >  Documentation/git-p4.txt | 10 ++++++++++
> >  git-p4.py                | 11 ++++++++++-
> >  2 files changed, 20 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
> > index f89e68b..71f3487 100644
> > --- a/Documentation/git-p4.txt
> > +++ b/Documentation/git-p4.txt
> > @@ -638,6 +638,16 @@ git-p4.pathEncoding::
> >   to transcode the paths to UTF-8. As an example, Perforce on Windows
> >   often uses "cp1252" to encode path names.
> >
> > +git-p4.fallbackEncoding::
> > +    Perforce changeset descriptions can be in a mixture of encodings. Git-p4
> > +    first tries to interpret each description as UTF-8. If that fails, this
> > +    config allows another encoding to be tried.  The default is "cp1252".  You
>
> I know that cp1252 is attractive to be used, especially for Windows installations that
> use Latin-based "characters".
> But: If we introduce a new config-variable into Git, the default tends to be
> "if not set to anything, behave as the old Git".
>
> > +    can set it to another encoding, for example, "iso-8859-5". If instead of
> ISO-8859-5 may be more portable on the different i18 implementations
> than the lower-case spelling.
>
> > +    an encoding, you specify "replace", UTF-8 will be used, with invalid UTF-8
> > +    characters replaced by the Unicode replacement character. If you specify
> > +    "none", there is no fallback, and any non UTF-8 character will cause
> > +    git-p4 to immediately fail.
>
> As said, before, many people may expect Git to fail, so that the default should be
> none to avoid surprises.
> When a "non-UTF-8-clean" repo is handled, they want to know it.
>
> > +
> >  git-p4.largeFileSystem::
> >   Specify the system that is used for large (binary) files. Please note
> >   that large file systems do not support the 'git p4 submit' command.
> > diff --git a/git-p4.py b/git-p4.py
> > index 09c9e93..18d02b4 100755
> > --- a/git-p4.py
> > +++ b/git-p4.py
> > @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b',
> > cb=None, skip_info=False,
> >                  for key, value in entry.items():
> >                      key = key.decode()
> >                      if isinstance(value, bytes) and not (key in
> > ('data', 'path', 'clientFile') or key.startswith('depotFile')):
> > -                        value = value.decode()
> > +                        try:
> > +                            value = value.decode()
> > +                        except:
> > +                            fallbackEncoding =
> > gitConfig("git-p4.fallbackEncoding").lower() or 'cp1252'
> > +                            if fallbackEncoding == 'none':
> > +                                raise
>
> Would it make sense to tell the user about the new config value here?
>  raise Exception("Non UTF-8 detected. See git-p4.fallbackEncoding"
> Or somewhat in that style ?
>
> > +                            elif fallbackEncoding == 'replace':
> > +                                value = value.decode(errors='replace')
> > +                            else:
> > +                                value = value.decode(encoding=fallbackEncoding)
> >                      decoded_entry[key] = value
> >                  # Parse out data if it's an error response
> >                  if decoded_entry.get('code') == 'error' and 'data' in
> > decoded_entry:
>
>
> Did I miss the Signed-off-by here?
>
> Please have a look here:
> https://git-scm.com/docs/SubmittingPatches
>
> (or look at Documentation/SubmittingPatches in your git source code)
>
> And finally: Thanks for the contribution.
> Is there any chance to add test-cases, to make sure that this feature
> is well-tested now and in the future ?
>
>
> > --
> > 2.31.1.windows.1
> >
> > On Fri, Apr 9, 2021 at 8:38 AM Torsten Bögershausen <tboegi@web.de> wrote:
> > >
> > > On Thu, Apr 08, 2021 at 12:28:25PM -0700, Tzadik Vanderhoof wrote:
> > > > When git-p4 reads the output from a p4 command, it assumes it will be
> > > > 100% UTF-8. If even one character in the output of one p4 command is
> > > > not UTF-8, git-p4 crashes with:
> > > >
> > > > File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
> > > >     value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
> > > > decode byte Ox93 in position 42: invalid start byte
> > > >
> > > > I'd like to make a pull request to have it try another encoding (eg
> > > > cp1252) and/or use the Unicode replacement character, to prevent the
> > > > whole program from crashing on such a minor problem.
> > > >
> > > > This is especially a problem on the "git p4 clone" command with @all,
> > > > where git-p4 needs to read thousands of changeset descriptions, one of
> > > > which may have a stray smart quote, causing the whole clone operation
> > > > to fail.
> > > >
> > > > Sound ok?
> > >
> > > Welcome to the Git community.
> > > To start with: I am not a git-p4 expert as such, but seeing that a program is crashing
> > > is never a good thing.
> > > All efforts to prevent the crash are a step forward.
> > >
> > > As you mention cp1252 (which is more used under Windows), there are probably lots of
> > > system out there which use ISO-8859-15 (or ISO-8859-1) we may have the first whish:
> > >
> > > Make the encoding/fallback configurable.
> > > Let people choose if they want a crash (if things are broken),
> > > fallback to cp1252 or one of the other ISO-ISO-8859-x encodings.
> > >
> > > In that sense: we look forward to a pull-request.
> >
> >
> >
> > --
> > Tzadik



-- 
Tzadik

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: git-p4 crashes on non UTF-8 output from p4
  2021-04-11 20:21       ` Tzadik Vanderhoof
@ 2021-04-12  4:06         ` Torsten Bögershausen
  2021-04-21  8:46           ` [PATCH] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions Tzadik Vanderhoof
  0 siblings, 1 reply; 24+ messages in thread
From: Torsten Bögershausen @ 2021-04-12  4:06 UTC (permalink / raw)
  To: Tzadik Vanderhoof; +Cc: git


Hej Tzadik,

Let's start with a side note:
This mailing list doesn't use top-posting, everything new is at the end,
so I moved everything there.

And removed some of the old stuff.

> On Sun, Apr 11, 2021 at 2:37 AM Torsten Bögershausen <tboegi@web.de> wrote:
> >
> > On Sun, Apr 11, 2021 at 12:16:25AM -0700, Tzadik Vanderhoof wrote:
> > > Here is the pull request:
> >
> > Thanks for the work. Some comments inline.
> >
> > >
> > > From 8d234af842223dceae76ce0affd3bbb3f17bb6d9 Mon Sep 17 00:00:00 2001
> > > From: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
> > > Date: Sat, 10 Apr 2021 22:41:39 -0700
> >
> >
> > The subject should be one short line, highlighting what this is all about,
> > followed by a blank line and a longer description about the problem and
> > the solution. The original description was good, see below.
> >
> > > Subject: [PATCH] add git-p4.fallbackEncoding config variable, to prevent
> > >  git-p4 from crashing on non UTF-8 changeset descriptions
> >
> > In that sense I make a first trial here, subject for improvements:
> >
> >
> > Subject: [PATCH] Add git-p4.fallbackEncoding config variable
> >
> > When git-p4 reads the output from a p4 command, it assumes it will be
> > 100% UTF-8. If even one character in the output of one p4 command is
> > not UTF-8, git-p4 crashes e.g. with:
> >
> > File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
> >     value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
> > decode byte Ox93 in position 42: invalid start byte
> >
> > Allow to try another encoding (eg cp1252) and/or use the
> > Unicode replacement character  to prevent the whole program from crashing
> > on such a "minor" problem.
> >
> > This is especially a problem on the "git p4 clone" command with @all,
> > where git-p4 needs to read thousands of changeset descriptions, one of
> > which may have a stray smart quote, causing the whole clone operation
> > to fail.
> >
> > Introduce "git-p4.fallbackEncoding" to handle non UTF-8 encodings, if needed.
> >
> > >
> > > ---
> > >  Documentation/git-p4.txt | 10 ++++++++++
> > >  git-p4.py                | 11 ++++++++++-
> > >  2 files changed, 20 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
> > > index f89e68b..71f3487 100644
> > > --- a/Documentation/git-p4.txt
> > > +++ b/Documentation/git-p4.txt
> > > @@ -638,6 +638,16 @@ git-p4.pathEncoding::
> > >   to transcode the paths to UTF-8. As an example, Perforce on Windows
> > >   often uses "cp1252" to encode path names.
> > >
> > > +git-p4.fallbackEncoding::
> > > +    Perforce changeset descriptions can be in a mixture of encodings. Git-p4
> > > +    first tries to interpret each description as UTF-8. If that fails, this
> > > +    config allows another encoding to be tried.  The default is "cp1252".  You
> >
> > I know that cp1252 is attractive to be used, especially for Windows installations that
> > use Latin-based "characters".
> > But: If we introduce a new config-variable into Git, the default tends to be
> > "if not set to anything, behave as the old Git".
> >
> > > +    can set it to another encoding, for example, "iso-8859-5". If instead of
> > ISO-8859-5 may be more portable on the different i18 implementations
> > than the lower-case spelling.
> >
> > > +    an encoding, you specify "replace", UTF-8 will be used, with invalid UTF-8
> > > +    characters replaced by the Unicode replacement character. If you specify
> > > +    "none", there is no fallback, and any non UTF-8 character will cause
> > > +    git-p4 to immediately fail.
> >
> > As said, before, many people may expect Git to fail, so that the default should be
> > none to avoid surprises.
> > When a "non-UTF-8-clean" repo is handled, they want to know it.
> >
> > > +
> > >  git-p4.largeFileSystem::
> > >   Specify the system that is used for large (binary) files. Please note
> > >   that large file systems do not support the 'git p4 submit' command.
> > > diff --git a/git-p4.py b/git-p4.py
> > > index 09c9e93..18d02b4 100755
> > > --- a/git-p4.py
> > > +++ b/git-p4.py
> > > @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b',
> > > cb=None, skip_info=False,
> > >                  for key, value in entry.items():
> > >                      key = key.decode()
> > >                      if isinstance(value, bytes) and not (key in
> > > ('data', 'path', 'clientFile') or key.startswith('depotFile')):
> > > -                        value = value.decode()
> > > +                        try:
> > > +                            value = value.decode()
> > > +                        except:
> > > +                            fallbackEncoding =
> > > gitConfig("git-p4.fallbackEncoding").lower() or 'cp1252'
> > > +                            if fallbackEncoding == 'none':
> > > +                                raise
> >
> > Would it make sense to tell the user about the new config value here?
> >  raise Exception("Non UTF-8 detected. See git-p4.fallbackEncoding"
> > Or somewhat in that style ?
> >
> > > +                            elif fallbackEncoding == 'replace':
> > > +                                value = value.decode(errors='replace')
> > > +                            else:
> > > +                                value = value.decode(encoding=fallbackEncoding)
> > >                      decoded_entry[key] = value
> > >                  # Parse out data if it's an error response
> > >                  if decoded_entry.get('code') == 'error' and 'data' in
> > > decoded_entry:
> >
> >
> > Did I miss the Signed-off-by here?
> >
> > Please have a look here:
> > https://git-scm.com/docs/SubmittingPatches
> >
> > (or look at Documentation/SubmittingPatches in your git source code)
> >
> > And finally: Thanks for the contribution.
> > Is there any chance to add test-cases, to make sure that this feature
> > is well-tested now and in the future ?
> >
[snip]


On Sun, Apr 11, 2021 at 01:21:47PM -0700, Tzadik Vanderhoof wrote:
> Thank you for your excellent and friendly feedback!
>
> I understand everything you said, but I have a question about the unit
> test you requested.  The git-p4.py script currently does not have
> tests and is not written in a way that would be testable.  (The Python
> function I modified calls into the shell and requires a valid Perforce
> installation.)

I am not really sure about that (there are no test cases).
A valid Perforce installation is needed, yes. Otherwise the p4 tests are skipped.
There are a lot of p4 tests under t/t98..p4....sh

git show a8b05162e894b88aeb7d5064dab

Tells me e.g. that both git-p4.py and, in this very commit,
t/t9822-git-p4-path-encoding.sh have been changed.

All the test are t98xx-git-p4-something.sh (fiund under t),
and you new testcase may be named something like

t9835-git-p4-fallbackEncoding.sh

>
> Would you prefer I  a) refactor the code to be testable and then write
> tests  or b) skip the unit testing (not sure if there are any further
> options)?
>
> (For option a) I would break out the part of the function I modified
> into another function and then call my new function for my testing.
> (I guess it would be better to break the refactoring and my changes
> into 2 separate commits.)
>

Probably not. Typically we can construct a test case first,
and see that ot fails (in you local test-running).
After updating git-p4.py with your improvments the new test should pass.
And of course all the other p4 tests.

There is a whole bunch of "CI tests", run on Linux, MacOs, Windows.
One example of a test run is here, and the "regular (linux-clang,...)
is running the p4 tests:

https://github.com/git/git/runs/2309995044?check_suite_focus=true

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions
  2021-04-12  4:06         ` Torsten Bögershausen
@ 2021-04-21  8:46           ` Tzadik Vanderhoof
  2021-04-21  8:55             ` Tzadik Vanderhoof
  0 siblings, 1 reply; 24+ messages in thread
From: Tzadik Vanderhoof @ 2021-04-21  8:46 UTC (permalink / raw)
  To: git, tboegi, tzadik.vanderhoof

---
 Documentation/git-p4.txt                   | 10 ++++
 git-p4.py                                  | 11 +++-
 t/t9835-git-p4-config-fallback-encoding.sh | 65 ++++++++++++++++++++++
 3 files changed, 85 insertions(+), 1 deletion(-)
 create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh

diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
index f89e68b..e0131a9 100644
--- a/Documentation/git-p4.txt
+++ b/Documentation/git-p4.txt
@@ -638,6 +638,16 @@ git-p4.pathEncoding::
 	to transcode the paths to UTF-8. As an example, Perforce on Windows
 	often uses "cp1252" to encode path names.
 
+git-p4.fallbackEncoding::
+	Perforce changeset descriptions can be in a mixture of encodings.
+	Git-p4 first tries to interpret each description as UTF-8. If that
+	fails, this config allows another encoding to be tried. You
+	can specify, for example, "cp1252". If instead of an encoding,
+	you specify "replace", UTF-8 will be used, with invalid UTF-8
+	characters replaced by the Unicode replacement character. If you
+	specify "none" (the default), there is no fallback, and any non
+	UTF-8 character will cause git-p4 to immediately fail.
+
 git-p4.largeFileSystem::
 	Specify the system that is used for large (binary) files. Please note
 	that large file systems do not support the 'git p4 submit' command.
diff --git a/git-p4.py b/git-p4.py
index 09c9e93..173f78a 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
                 for key, value in entry.items():
                     key = key.decode()
                     if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
-                        value = value.decode()
+                        try:
+                            value = value.decode()
+                        except UnicodeDecodeError as ex:
+                            fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none'
+                            if fallbackEncoding == 'none':
+                                raise Exception("UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding") from ex
+                            elif fallbackEncoding == 'replace':
+                                value = value.decode(errors='replace')
+                            else:
+                                value = value.decode(encoding=fallbackEncoding)
                     decoded_entry[key] = value
                 # Parse out data if it's an error response
                 if decoded_entry.get('code') == 'error' and 'data' in decoded_entry:
diff --git a/t/t9835-git-p4-config-fallback-encoding.sh b/t/t9835-git-p4-config-fallback-encoding.sh
new file mode 100755
index 0000000..56a245e
--- /dev/null
+++ b/t/t9835-git-p4-config-fallback-encoding.sh
@@ -0,0 +1,65 @@
+#!/bin/sh
+
+test_description='test git-p4.fallbackEncoding config'
+
+GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
+export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
+
+. ./lib-git-p4.sh
+
+if test_have_prereq !MINGW,!CYGWIN; then
+	skip_all='This system is not subject to encoding failures in "git p4 clone"'
+	test_done
+fi
+
+test_expect_success 'start p4d' '
+	start_p4d
+'
+
+test_expect_success 'add cp1252 description' '
+	cd "$cli" &&
+	echo file1 >file1 &&
+	p4 add file1 &&
+	p4 submit -d documentación
+'
+
+test_expect_success 'clone fails with git-p4.fallbackEncoding unset' '
+	test_might_fail git config --global --unset git-p4.fallbackEncoding &&
+	test_when_finished cleanup_git &&
+	(
+		test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual &&
+		grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual
+	)
+'
+test_expect_success 'clone fails with git-p4.fallbackEncoding set to "none"' '
+	git config --global git-p4.fallbackEncoding none &&
+	test_when_finished cleanup_git &&
+	(
+		test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual &&
+		grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual
+	)
+'
+
+test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "cp1252"' '
+	git config --global git-p4.fallbackEncoding cp1252 &&
+	test_when_finished cleanup_git &&
+	(
+		git p4 clone --dest="$git" //depot@all &&
+		cd "$git" &&
+		git log --oneline >log &&
+		desc=$(head -1 log | awk '\''{print $2}'\'') &&	[ "$desc" = "documentación" ]
+	)
+'
+
+test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "replace"' '
+	git config --global git-p4.fallbackEncoding replace &&
+	test_when_finished cleanup_git &&
+	(
+		git p4 clone --dest="$git" //depot@all &&
+		cd "$git" &&
+		git log --oneline >log &&
+		desc=$(head -1 log | awk '\''{print $2}'\'') &&	[ "$desc" = "documentaci�n" ]
+	)
+'
+
+test_done
-- 
2.31.1.windows.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions
  2021-04-21  8:46           ` [PATCH] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions Tzadik Vanderhoof
@ 2021-04-21  8:55             ` Tzadik Vanderhoof
  2021-04-22  5:05               ` [PATCH v3] add git-p4.fallbackEncoding config setting, " Tzadik Vanderhoof
  0 siblings, 1 reply; 24+ messages in thread
From: Tzadik Vanderhoof @ 2021-04-21  8:55 UTC (permalink / raw)
  To: git, Torsten Bögershausen, Tzadik Vanderhoof

Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3] add git-p4.fallbackEncoding config setting, to prevent git-p4 from crashing on non UTF-8 changeset descriptions
  2021-04-21  8:55             ` Tzadik Vanderhoof
@ 2021-04-22  5:05               ` Tzadik Vanderhoof
  2021-04-22 15:50                 ` Torsten Bögershausen
  0 siblings, 1 reply; 24+ messages in thread
From: Tzadik Vanderhoof @ 2021-04-22  5:05 UTC (permalink / raw)
  To: git, tboegi; +Cc: Tzadik Vanderhoof

When git-p4 reads the output from a p4 command, it assumes it will be
100% UTF-8. If even one character in the output of one p4 command is
not UTF-8, git-p4 crashes with:

File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
    value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
decode byte Ox93 in position 42: invalid start byte

This is especially a problem on the "git p4 clone ... @all" command,
where git-p4 needs to read thousands of changeset descriptions, one of
which may have a stray smart quote, causing the whole clone operation
to fail.

This pull request adds a new config setting, allowing git-p4 to try
another encoding (for example, "cp1252") and/or use the Unicode replacement
character, to prevent the whole program from crashing on such a minor problem.

See the documentation included in the patch for more details of how
the new config setting works.

Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
---
 Documentation/git-p4.txt                   | 10 ++++
 git-p4.py                                  | 11 +++-
 t/t9835-git-p4-config-fallback-encoding.sh | 65 ++++++++++++++++++++++
 3 files changed, 85 insertions(+), 1 deletion(-)
 create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh

diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
index f89e68b..e0131a9 100644
--- a/Documentation/git-p4.txt
+++ b/Documentation/git-p4.txt
@@ -638,6 +638,16 @@ git-p4.pathEncoding::
 	to transcode the paths to UTF-8. As an example, Perforce on Windows
 	often uses "cp1252" to encode path names.
 
+git-p4.fallbackEncoding::
+	Perforce changeset descriptions can be in a mixture of encodings.
+	Git-p4 first tries to interpret each description as UTF-8. If that
+	fails, this config allows another encoding to be tried. You
+	can specify, for example, "cp1252". If instead of an encoding,
+	you specify "replace", UTF-8 will be used, with invalid UTF-8
+	characters replaced by the Unicode replacement character. If you
+	specify "none" (the default), there is no fallback, and any non
+	UTF-8 character will cause git-p4 to immediately fail.
+
 git-p4.largeFileSystem::
 	Specify the system that is used for large (binary) files. Please note
 	that large file systems do not support the 'git p4 submit' command.
diff --git a/git-p4.py b/git-p4.py
index 09c9e93..3364287 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
                 for key, value in entry.items():
                     key = key.decode()
                     if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
-                        value = value.decode()
+                        try:
+                            value = value.decode()
+                        except UnicodeDecodeError as ex:
+                            fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none'
+                            if fallbackEncoding == 'none':
+                                raise Exception("UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding")
+                            elif fallbackEncoding == 'replace':
+                                value = value.decode(errors='replace')
+                            else:
+                                value = value.decode(encoding=fallbackEncoding)
                     decoded_entry[key] = value
                 # Parse out data if it's an error response
                 if decoded_entry.get('code') == 'error' and 'data' in decoded_entry:
diff --git a/t/t9835-git-p4-config-fallback-encoding.sh b/t/t9835-git-p4-config-fallback-encoding.sh
new file mode 100755
index 0000000..56a245e
--- /dev/null
+++ b/t/t9835-git-p4-config-fallback-encoding.sh
@@ -0,0 +1,65 @@
+#!/bin/sh
+
+test_description='test git-p4.fallbackEncoding config'
+
+GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
+export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
+
+. ./lib-git-p4.sh
+
+if test_have_prereq !MINGW,!CYGWIN; then
+	skip_all='This system is not subject to encoding failures in "git p4 clone"'
+	test_done
+fi
+
+test_expect_success 'start p4d' '
+	start_p4d
+'
+
+test_expect_success 'add cp1252 description' '
+	cd "$cli" &&
+	echo file1 >file1 &&
+	p4 add file1 &&
+	p4 submit -d documentación
+'
+
+test_expect_success 'clone fails with git-p4.fallbackEncoding unset' '
+	test_might_fail git config --global --unset git-p4.fallbackEncoding &&
+	test_when_finished cleanup_git &&
+	(
+		test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual &&
+		grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual
+	)
+'
+test_expect_success 'clone fails with git-p4.fallbackEncoding set to "none"' '
+	git config --global git-p4.fallbackEncoding none &&
+	test_when_finished cleanup_git &&
+	(
+		test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual &&
+		grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual
+	)
+'
+
+test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "cp1252"' '
+	git config --global git-p4.fallbackEncoding cp1252 &&
+	test_when_finished cleanup_git &&
+	(
+		git p4 clone --dest="$git" //depot@all &&
+		cd "$git" &&
+		git log --oneline >log &&
+		desc=$(head -1 log | awk '\''{print $2}'\'') &&	[ "$desc" = "documentación" ]
+	)
+'
+
+test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "replace"' '
+	git config --global git-p4.fallbackEncoding replace &&
+	test_when_finished cleanup_git &&
+	(
+		git p4 clone --dest="$git" //depot@all &&
+		cd "$git" &&
+		git log --oneline >log &&
+		desc=$(head -1 log | awk '\''{print $2}'\'') &&	[ "$desc" = "documentaci�n" ]
+	)
+'
+
+test_done
-- 
2.31.1.windows.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] add git-p4.fallbackEncoding config setting, to prevent git-p4 from crashing on non UTF-8 changeset descriptions
  2021-04-22  5:05               ` [PATCH v3] add git-p4.fallbackEncoding config setting, " Tzadik Vanderhoof
@ 2021-04-22 15:50                 ` Torsten Bögershausen
  2021-04-22 16:17                   ` Eric Sunshine
  0 siblings, 1 reply; 24+ messages in thread
From: Torsten Bögershausen @ 2021-04-22 15:50 UTC (permalink / raw)
  To: Tzadik Vanderhoof; +Cc: git

On Wed, Apr 21, 2021 at 10:05:04PM -0700, Tzadik Vanderhoof wrote:

Thanks for V3, please see some comments inline.

> When git-p4 reads the output from a p4 command, it assumes it will be
> 100% UTF-8. If even one character in the output of one p4 command is
> not UTF-8, git-p4 crashes with:
>
> File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
>     value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
> decode byte Ox93 in position 42: invalid start byte
>
> This is especially a problem on the "git p4 clone ... @all" command,
> where git-p4 needs to read thousands of changeset descriptions, one of
> which may have a stray smart quote, causing the whole clone operation
> to fail.
>

> This pull request adds a new config setting, allowing git-p4 to try
> another encoding (for example, "cp1252") and/or use the Unicode replacement
> character, to prevent the whole program from crashing on such a minor problem.

"This pull request" is somewhat superflous wording.
How about:

Add a new config setting, allowing git-p4 to try a fallback encoding
(for example, "cp1252") and/or use the Unicode replacement character,
to prevent the whole program from crashing on such a minor problem.

Documentation is good (and needed, and neccessary).
Probably this is then not needed:
> See the documentation included in the patch for more details of how
> the new config setting works.

>
> Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
> ---
>  Documentation/git-p4.txt                   | 10 ++++
>  git-p4.py                                  | 11 +++-
>  t/t9835-git-p4-config-fallback-encoding.sh | 65 ++++++++++++++++++++++
>  3 files changed, 85 insertions(+), 1 deletion(-)
>  create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh
>
> diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
> index f89e68b..e0131a9 100644
> --- a/Documentation/git-p4.txt
> +++ b/Documentation/git-p4.txt
> @@ -638,6 +638,16 @@ git-p4.pathEncoding::
>  	to transcode the paths to UTF-8. As an example, Perforce on Windows
>  	often uses "cp1252" to encode path names.
>
> +git-p4.fallbackEncoding::
> +	Perforce changeset descriptions can be in a mixture of encodings.
> +	Git-p4 first tries to interpret each description as UTF-8. If that
> +	fails, this config allows another encoding to be tried. You
> +	can specify, for example, "cp1252".

That looks OK according to
https://docs.python.org/3/library/codecs.html#standard-encodings

> + If instead of an encoding,
> +	you specify "replace", UTF-8 will be used, with invalid UTF-8
> +	characters replaced by the Unicode replacement character. If you
> +	specify "none" (the default), there is no fallback, and any non
> +	UTF-8 character will cause git-p4 to immediately fail.
> +

May be, that is a matter of taste:

> + If git-p4.fallbackEncoding is "replace" ", UTF-8 will be used, with invalid UTF-8
> +	characters replaced by the Unicode replacement character.
> +	The default is "none": there is no fallback, and any non
> +	UTF-8 character will cause git-p4 to immediately fail.
> +



>  git-p4.largeFileSystem::
>  	Specify the system that is used for large (binary) files. Please note
>  	that large file systems do not support the 'git p4 submit' command.
> diff --git a/git-p4.py b/git-p4.py
> index 09c9e93..3364287 100755
> --- a/git-p4.py
> +++ b/git-p4.py
> @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
>                  for key, value in entry.items():
>                      key = key.decode()
>                      if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
> -                        value = value.decode()
> +                        try:
> +                            value = value.decode()
> +                        except UnicodeDecodeError as ex:
> +                            fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none'
> +                            if fallbackEncoding == 'none':
> +                                raise Exception("UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding")
> +                            elif fallbackEncoding == 'replace':
> +                                value = value.decode(errors='replace')
> +                            else:
> +                                value = value.decode(encoding=fallbackEncoding)
>                      decoded_entry[key] = value
>                  # Parse out data if it's an error response
>                  if decoded_entry.get('code') == 'error' and 'data' in decoded_entry:
> diff --git a/t/t9835-git-p4-config-fallback-encoding.sh b/t/t9835-git-p4-config-fallback-encoding.sh
> new file mode 100755
> index 0000000..56a245e
> --- /dev/null
> +++ b/t/t9835-git-p4-config-fallback-encoding.sh
> @@ -0,0 +1,65 @@
> +#!/bin/sh
> +
> +test_description='test git-p4.fallbackEncoding config'
> +
> +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
> +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
> +
> +. ./lib-git-p4.sh
> +
> +if test_have_prereq !MINGW,!CYGWIN; then
> +	skip_all='This system is not subject to encoding failures in "git p4 clone"'
> +	test_done
> +fi

Out of curiosity: Why are Windows versions (MINGW, CYGWIN) excluded ?

> +
> +test_expect_success 'start p4d' '
> +	start_p4d
> +'
> +
> +test_expect_success 'add cp1252 description' '
> +	cd "$cli" &&
> +	echo file1 >file1 &&
> +	p4 add file1 &&
> +	p4 submit -d documentación
> +'
> +
> +test_expect_success 'clone fails with git-p4.fallbackEncoding unset' '
> +	test_might_fail git config --global --unset git-p4.fallbackEncoding &&
> +	test_when_finished cleanup_git &&
> +	(
> +		test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual &&
Should this be >actual ?
And please no ' ' between the '>' and the filename.

> +		grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual
> +	)
> +'
> +test_expect_success 'clone fails with git-p4.fallbackEncoding set to "none"' '
> +	git config --global git-p4.fallbackEncoding none &&
> +	test_when_finished cleanup_git &&
> +	(
> +		test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual &&
Same here
2 >actual

> +		grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual
> +	)
> +'
> +
> +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "cp1252"' '
> +	git config --global git-p4.fallbackEncoding cp1252 &&
> +	test_when_finished cleanup_git &&
> +	(
> +		git p4 clone --dest="$git" //depot@all &&
> +		cd "$git" &&
> +		git log --oneline >log &&
> +		desc=$(head -1 log | awk '\''{print $2}'\'') &&	[ "$desc" = "documentación" ]

Style nit:
See Documentation/CodingGuidelines: - We prefer "test" over "[ ... ]".

desc=$(head -1 log | awk '\''{print $2}'\'') &&	test "$desc" = "documentación"

> +	)
> +'
> +
> +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "replace"' '
> +	git config --global git-p4.fallbackEncoding replace &&
> +	test_when_finished cleanup_git &&
> +	(
> +		git p4 clone --dest="$git" //depot@all &&
> +		cd "$git" &&
> +		git log --oneline >log &&
> +		desc=$(head -1 log | awk '\''{print $2}'\'') &&	[ "$desc" = "documentaci�n" ]
> +	)
> +'
> +
> +test_done
> --
> 2.31.1.windows.1
>

Are there any more comments from the p4 experts ?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] add git-p4.fallbackEncoding config setting, to prevent git-p4 from crashing on non UTF-8 changeset descriptions
  2021-04-22 15:50                 ` Torsten Bögershausen
@ 2021-04-22 16:17                   ` Eric Sunshine
  2021-04-22 22:33                     ` Eric Sunshine
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Sunshine @ 2021-04-22 16:17 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: Tzadik Vanderhoof, Git List

On Thu, Apr 22, 2021 at 11:51 AM Torsten Bögershausen <tboegi@web.de> wrote:
> On Wed, Apr 21, 2021 at 10:05:04PM -0700, Tzadik Vanderhoof wrote:
> > +if test_have_prereq !MINGW,!CYGWIN; then
> > +     skip_all='This system is not subject to encoding failures in "git p4 clone"'
> > +     test_done
> > +fi
>
> Out of curiosity: Why are Windows versions (MINGW, CYGWIN) excluded ?

The answer to this question is probably worthy of recording as an
in-code comment just above this conditional so that people coming upon
this test script in the future don't have to ask the same question
(which is especially important if the author is no longer reachable).
If an in-code comment is overkill, then it would probably be a good
idea for the commit message to explain the reason.

> > +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "cp1252"' '
> > +     git config --global git-p4.fallbackEncoding cp1252 &&
> > +     test_when_finished cleanup_git &&
> > +     (
> > +             git p4 clone --dest="$git" //depot@all &&
> > +             cd "$git" &&
> > +             git log --oneline >log &&
> > +             desc=$(head -1 log | awk '\''{print $2}'\'') && [ "$desc" = "documentación" ]
>
> Style nit:
> See Documentation/CodingGuidelines: - We prefer "test" over "[ ... ]".
>
> desc=$(head -1 log | awk '\''{print $2}'\'') && test "$desc" = "documentación"

Style also suggests splitting the line after the &&.

We normally want to avoid using bare single-quotes inside the body of
the test since the body itself is a single-quoted string. These
single-quotes make it harder for a reader to reason about what is
going on; especially with the $2 in there, one has to spend extra
cycles wondering if $2 is correctly expanded when the test runs or
when it is first defined. So, an easier-to-understand rewrite might
be:

    desc=$(head -1 log | awk ''{print \$2}'') &&
    test "$desc" = "documentación"

Many existing tests in this project use `cut` for word-plucking, so an
alternative would be:

    desc=$(head -1 log | cut -d" " -f2) &&

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] add git-p4.fallbackEncoding config setting, to prevent git-p4 from crashing on non UTF-8 changeset descriptions
  2021-04-22 16:17                   ` Eric Sunshine
@ 2021-04-22 22:33                     ` Eric Sunshine
  2021-04-23  6:36                       ` [PATCH] add git-p4.fallbackEncoding config variable, " Tzadik Vanderhoof
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Sunshine @ 2021-04-22 22:33 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: Tzadik Vanderhoof, Git List

On Thu, Apr 22, 2021 at 12:17 PM Eric Sunshine <sunshine@sunshineco.com> wrote:
> We normally want to avoid using bare single-quotes inside the body of
> the test since the body itself is a single-quoted string. These
> single-quotes make it harder for a reader to reason about what is
> going on; especially with the $2 in there, one has to spend extra
> cycles wondering if $2 is correctly expanded when the test runs or
> when it is first defined. So, an easier-to-understand rewrite might
> be:
>
>     desc=$(head -1 log | awk ''{print \$2}'') &&

Of course, the quotes surrounding the {print...} should be
double-quotes, not pairs of single-quotes:

    desc=$(head -1 log | awk "{print \$2}") &&

(I didn't notice the problem when originally composing the email since
the compose window wasn't using a fixed-width font, and only noticed
it later when re-reading it in a mail reader which does use
fixed-width. Sorry for any potential confusion.)

> Many existing tests in this project use `cut` for word-plucking, so an
> alternative would be:
>
>     desc=$(head -1 log | cut -d" " -f2) &&

At any rate, using `cut` would be a good option since there's plenty
of precedent in existing test scripts.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions
  2021-04-22 22:33                     ` Eric Sunshine
@ 2021-04-23  6:36                       ` Tzadik Vanderhoof
  2021-04-23  6:44                         ` Tzadik Vanderhoof
  0 siblings, 1 reply; 24+ messages in thread
From: Tzadik Vanderhoof @ 2021-04-23  6:36 UTC (permalink / raw)
  To: Eric Sunshine, Torsten Bögershausen, Git List; +Cc: Tzadik Vanderhoof

When git-p4 reads the output from a p4 command, it assumes it will be
100% UTF-8. If even one character in the output of one p4 command is
not UTF-8, git-p4 crashes with:

    File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
        value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
        decode byte Ox93 in position 42: invalid start byte

This is especially a problem for the "git p4 clone ... @all" command,
where git-p4 needs to read thousands of changeset descriptions, one of
which may have a stray smart quote, causing the whole clone operation
to fail.

Add a new config setting, allowing git-p4 to try a fallback encoding
(for example, "cp1252") and/or use the Unicode replacement character,
to prevent the whole program from crashing on such a minor problem.

Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
---
 Documentation/git-p4.txt                   |  9 +++
 git-p4.py                                  | 11 +++-
 t/t9835-git-p4-config-fallback-encoding.sh | 76 ++++++++++++++++++++++
 3 files changed, 95 insertions(+), 1 deletion(-)
 create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh

diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
index f89e68b424..86d3ffa644 100644
--- a/Documentation/git-p4.txt
+++ b/Documentation/git-p4.txt
@@ -638,6 +638,15 @@ git-p4.pathEncoding::
 	to transcode the paths to UTF-8. As an example, Perforce on Windows
 	often uses "cp1252" to encode path names.
 
+git-p4.fallbackEncoding::
+	Perforce changeset descriptions can be stored in any encoding.
+	Git-p4 first tries to interpret each description as UTF-8. If that
+	fails, this config allows another encoding to be tried. You can specify,
+	for example, "cp1252". If git-p4.fallbackEncoding is "replace", UTF-8 will
+	be used, with invalid UTF-8 characters replaced by the Unicode replacement
+	character. The default is "none": there is no fallback, and any non UTF-8
+	character will cause git-p4 to immediately fail.
+
 git-p4.largeFileSystem::
 	Specify the system that is used for large (binary) files. Please note
 	that large file systems do not support the 'git p4 submit' command.
diff --git a/git-p4.py b/git-p4.py
index 09c9e93ac4..202fb01bdf 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
                 for key, value in entry.items():
                     key = key.decode()
                     if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
-                        value = value.decode()
+                        try:
+                            value = value.decode()
+                        except UnicodeDecodeError:
+                            fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none'
+                            if fallbackEncoding == 'none':
+                                raise Exception("UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding")
+                            elif fallbackEncoding == 'replace':
+                                value = value.decode(errors='replace')
+                            else:
+                                value = value.decode(encoding=fallbackEncoding)
                     decoded_entry[key] = value
                 # Parse out data if it's an error response
                 if decoded_entry.get('code') == 'error' and 'data' in decoded_entry:
diff --git a/t/t9835-git-p4-config-fallback-encoding.sh b/t/t9835-git-p4-config-fallback-encoding.sh
new file mode 100755
index 0000000000..ce352c826b
--- /dev/null
+++ b/t/t9835-git-p4-config-fallback-encoding.sh
@@ -0,0 +1,76 @@
+#!/bin/sh
+
+test_description='test git-p4.fallbackEncoding config'
+
+GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
+export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
+
+. ./lib-git-p4.sh
+
+# The Windows build of p4 encodes its command-line arguments according to the
+# active code page (which defaults to "cp1252"). As a result, "p4 submit -d" causes
+# Unicode changeset descriptions to be stored in the Perforce database as cp1252,
+# and a subsequent "git p4 clone" attempting to decode these descriptions as UTF-8
+# will raise a UnicodeDecodeError, necessitating the use of the git-p4.fallbackEncoding config.
+#
+# The Linux build of p4 encodes its command-line arguments as UTF-8, so changeset descriptions
+# are stored as UTF-8, and UnicodeDecodeError is never raised by "git p4 clone".
+
+if test_have_prereq !MINGW,!CYGWIN; then
+	skip_all='This system is not subject to encoding failures in "git p4 clone"'
+	test_done
+fi
+
+test_expect_success 'start p4d' '
+	start_p4d
+'
+
+test_expect_success 'add cp1252 description' '
+	cd "$cli" &&
+	echo file1 >file1 &&
+	p4 add file1 &&
+	p4 submit -d documentación
+'
+
+test_expect_success 'clone fails with git-p4.fallbackEncoding unset' '
+	test_might_fail git config --global --unset git-p4.fallbackEncoding &&
+	test_when_finished cleanup_git &&
+	(
+		test_must_fail git p4 clone --dest="$git" //depot@all 2>error &&
+		grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error
+	)
+'
+test_expect_success 'clone fails with git-p4.fallbackEncoding set to "none"' '
+	git config --global git-p4.fallbackEncoding none &&
+	test_when_finished cleanup_git &&
+	(
+		test_must_fail git p4 clone --dest="$git" //depot@all 2>error &&
+		grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error
+	)
+'
+
+test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "cp1252"' '
+	git config --global git-p4.fallbackEncoding cp1252 &&
+	test_when_finished cleanup_git &&
+	(
+		git p4 clone --dest="$git" //depot@all &&
+		cd "$git" &&
+		git log --oneline >log &&
+		desc=$(head -1 log | cut -d" " -f2) &&
+		test "$desc" = "documentación"
+	)
+'
+
+test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "replace"' '
+	git config --global git-p4.fallbackEncoding replace &&
+	test_when_finished cleanup_git &&
+	(
+		git p4 clone --dest="$git" //depot@all &&
+		cd "$git" &&
+		git log --oneline >log &&
+		desc=$(head -1 log | cut -d" " -f2) &&
+		test "$desc" = "documentaci�n"
+	)
+'
+
+test_done
-- 
2.31.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions
  2021-04-23  6:36                       ` [PATCH] add git-p4.fallbackEncoding config variable, " Tzadik Vanderhoof
@ 2021-04-23  6:44                         ` Tzadik Vanderhoof
  2021-04-23 19:08                           ` Tzadik Vanderhoof
  0 siblings, 1 reply; 24+ messages in thread
From: Tzadik Vanderhoof @ 2021-04-23  6:44 UTC (permalink / raw)
  To: Eric Sunshine, Torsten Bögershausen, Git List

(last patch should be labeled as v4... sorry)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions
  2021-04-23  6:44                         ` Tzadik Vanderhoof
@ 2021-04-23 19:08                           ` Tzadik Vanderhoof
  2021-04-24  8:14                             ` Torsten Bögershausen
  0 siblings, 1 reply; 24+ messages in thread
From: Tzadik Vanderhoof @ 2021-04-23 19:08 UTC (permalink / raw)
  To: Eric Sunshine, Torsten Bögershausen, Git List

To clarify....

The new config variable I am introducing addresses an issue that only
occurs on Windows.  This is because the behavior of the "p4" command
differs on Windows vs Linux around Unicode in changeset descriptions.

I don't have the source code for "p4", but I'm guessing it's written
in C, and that this difference in behavior is simply a result of the
fact that there is no defined standard of how "char *argv[]" in "main"
should deal with non-ASCII characters being passed in from the command
line.

As a result, "git p4 clone" on Linux is not affected by this "p4"
behavior.  Since my tests assume the Windows behavior, they fail when
run on Linux.  For this reason, I added code to my tests to skip them
on Linux.

On a related note, I don't think there are any CI environments on
github for git that are (a) on Windows, and (b) have Python and (c)
have Perforce, so I don't think my tests are actually running on
github CI.  I'm not sure how that can be addressed.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions
  2021-04-23 19:08                           ` Tzadik Vanderhoof
@ 2021-04-24  8:14                             ` Torsten Bögershausen
  2021-04-27  5:39                               ` [PATCH v5] " Tzadik Vanderhoof
  0 siblings, 1 reply; 24+ messages in thread
From: Torsten Bögershausen @ 2021-04-24  8:14 UTC (permalink / raw)
  To: Tzadik Vanderhoof
  Cc: Eric Sunshine, Git List, Johannes Schindelin, Luke Diamand, Pete Wyckoff

(Adding some of the p4 and Windows experts in cc)

On Fri, Apr 23, 2021 at 12:08:17PM -0700, Tzadik Vanderhoof wrote:
> To clarify....

Good. This is good information, and the important stuff could go
into the commit message. And because the commit as such should be
self-contained (as much as possible).
Giving an overview about the problem.

>
> The new config variable I am introducing addresses an issue that only
> occurs on Windows.  This is because the behavior of the "p4" command
> differs on Windows vs Linux around Unicode in changeset descriptions.

What does Windows mean in this context ?
Is p4 a "console application" ?
In this case it may be possible to use CHCP to change to a different code page ?

>
> I don't have the source code for "p4", but I'm guessing it's written
> in C, and that this difference in behavior is simply a result of the
> fact that there is no defined standard of how "char *argv[]" in "main"
> should deal with non-ASCII characters being passed in from the command
> line.
>
> As a result, "git p4 clone" on Linux is not affected by this "p4"
> behavior.

Is it ?
What happens if yoy have a p4 depot that was feed from Windows in CP-1252 and is now
accessed from a Linux  machine ?
Doesthe Linux box face the same problems ?

> Since my tests assume the Windows behavior, they fail when
> run on Linux.  For this reason, I added code to my tests to skip them
> on Linux.

That makes sense, but what is the "Windows behavior" more in detail ?
My understanding is that when you press e.g. the key 'Ä' on the keybaord,
it will give a different byte sequence once that 'Ä' is transferred
across the wire (to the p4 server).
This is dependent on what Linux calls a locale, and all major Linux installations
use UTF-8 these days by default.
But that was not always the case, since in old days they used ISO-8851-1 or something
else, usable for your contry/region.

So most Windows "console applications" are not run under UTF-8, but it
may be possible that "CHCP 65000" (or so) works - more testing needed.
>
> On a related note, I don't think there are any CI environments on
> github for git that are (a) on Windows, and (b) have Python and (c)
> have Perforce, so I don't think my tests are actually running on
> github CI.  I'm not sure how that can be addressed.

That are 3 different questions -
(a) Yes, git is compiled under Windows, both gcc and MSVC (correct me if that is wrong)
(b) Yes, we have python on the different CI. Github actions has python, I use it every day.
(c) There are tests run for p4, but it seems if they are only run under Linux.

It would be nice if your test can pass under Linux, why are they failing ?

If I dig here:
<https://github.com/git/git/runs/2420889332?check_suite_focus=true>

We can see that the t98 test are run, and are passing. Just to pick one:
[15:28:22] t9804-git-p4-label.sh .............................. ok     3969

Thanks for working on this.
It would be good to have a v5 version of the patch with some more informations,
like above, and may be :how is the p4 server configured ?
(Unicode or not ?)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v5] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions
  2021-04-24  8:14                             ` Torsten Bögershausen
@ 2021-04-27  5:39                               ` Tzadik Vanderhoof
  2021-04-27  5:45                                 ` Tzadik Vanderhoof
  2021-04-28  4:39                                 ` Junio C Hamano
  0 siblings, 2 replies; 24+ messages in thread
From: Tzadik Vanderhoof @ 2021-04-27  5:39 UTC (permalink / raw)
  To: Eric Sunshine, Git List, Johannes Schindelin, Luke Diamand, Pete Wyckoff
  Cc: Tzadik Vanderhoof

When git-p4 reads the output from a p4 command, it assumes it will be
100% UTF-8. If even one character in the output of one p4 command is
not UTF-8, git-p4 crashes with:

    File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
        value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
        decode byte Ox93 in position 42: invalid start byte

This is especially a problem for the "git p4 clone ... @all" command,
where git-p4 needs to read thousands of changeset descriptions, one of
which may have a stray smart quote, causing the whole clone operation
to fail.

Add a new config setting, allowing git-p4 to try a fallback encoding
(for example, "cp1252") and/or use the Unicode replacement character,
to prevent the whole program from crashing on such a minor problem.

Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
---
 Documentation/git-p4.txt                   |  9 +++
 git-p4.py                                  | 11 ++-
 t/t9835-git-p4-config-fallback-encoding.sh | 87 ++++++++++++++++++++++
 3 files changed, 106 insertions(+), 1 deletion(-)
 create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh

diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
index f89e68b424..86d3ffa644 100644
--- a/Documentation/git-p4.txt
+++ b/Documentation/git-p4.txt
@@ -638,6 +638,15 @@ git-p4.pathEncoding::
 	to transcode the paths to UTF-8. As an example, Perforce on Windows
 	often uses "cp1252" to encode path names.
 
+git-p4.fallbackEncoding::
+	Perforce changeset descriptions can be stored in any encoding.
+	Git-p4 first tries to interpret each description as UTF-8. If that
+	fails, this config allows another encoding to be tried. You can specify,
+	for example, "cp1252". If git-p4.fallbackEncoding is "replace", UTF-8 will
+	be used, with invalid UTF-8 characters replaced by the Unicode replacement
+	character. The default is "none": there is no fallback, and any non UTF-8
+	character will cause git-p4 to immediately fail.
+
 git-p4.largeFileSystem::
 	Specify the system that is used for large (binary) files. Please note
 	that large file systems do not support the 'git p4 submit' command.
diff --git a/git-p4.py b/git-p4.py
index 09c9e93ac4..202fb01bdf 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
                 for key, value in entry.items():
                     key = key.decode()
                     if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
-                        value = value.decode()
+                        try:
+                            value = value.decode()
+                        except UnicodeDecodeError:
+                            fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none'
+                            if fallbackEncoding == 'none':
+                                raise Exception("UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding")
+                            elif fallbackEncoding == 'replace':
+                                value = value.decode(errors='replace')
+                            else:
+                                value = value.decode(encoding=fallbackEncoding)
                     decoded_entry[key] = value
                 # Parse out data if it's an error response
                 if decoded_entry.get('code') == 'error' and 'data' in decoded_entry:
diff --git a/t/t9835-git-p4-config-fallback-encoding.sh b/t/t9835-git-p4-config-fallback-encoding.sh
new file mode 100755
index 0000000000..383507803e
--- /dev/null
+++ b/t/t9835-git-p4-config-fallback-encoding.sh
@@ -0,0 +1,87 @@
+#!/bin/sh
+
+test_description='test git-p4.fallbackEncoding config'
+
+GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
+export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
+
+. ./lib-git-p4.sh
+
+test_expect_success 'start p4d' '
+	start_p4d
+'
+
+test_expect_success 'add Unicode description' '
+	cd "$cli" &&
+	echo file1 >file1 &&
+	p4 add file1 &&
+	p4 submit -d documentación
+'
+
+# Unicode descriptions cause clone to throw in some environments. This test
+# determines if that is the case in our environment. If so we create a file called "clone_fails".
+# We check that file to in subsequent tests to determine what behavior to expect.
+
+clone_fails="$TRASH_DIRECTORY/clone_fails"
+
+test_expect_success 'clone with git-p4.fallbackEncoding unset' '
+	test_might_fail git config --global --unset git-p4.fallbackEncoding &&
+	test_when_finished cleanup_git && {
+		git p4 clone --dest="$git" //depot@all 2>error || (
+			cp /dev/null "$clone_fails" &&
+			grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error
+		)
+	}
+'
+
+test_expect_success 'clone with git-p4.fallbackEncoding set to "none"' '
+	git config --global git-p4.fallbackEncoding none &&
+	test_when_finished cleanup_git && {
+		(
+			test -f "$clone_fails" &&
+			test_must_fail git p4 clone --dest="$git" //depot@all 2>error &&
+			grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error
+		) ||
+		(
+			! test -f "$clone_fails" &&
+			git p4 clone --dest="$git" //depot@all 2>error
+		)
+	}
+'
+
+test_expect_success 'clone with git-p4.fallbackEncoding set to "cp1252"' '
+	git config --global git-p4.fallbackEncoding cp1252 &&
+	test_when_finished cleanup_git &&
+	(
+		git p4 clone --dest="$git" //depot@all &&
+		cd "$git" &&
+		git log --oneline >log &&
+		desc=$(head -1 log | cut -d" " -f2) &&
+		test "$desc" = "documentación"
+	)
+'
+
+test_expect_success 'clone with git-p4.fallbackEncoding set to "replace"' '
+	git config --global git-p4.fallbackEncoding replace &&
+	test_when_finished cleanup_git &&
+	(
+		git p4 clone --dest="$git" //depot@all &&
+		cd "$git" &&
+		git log --oneline >log &&
+		desc=$(head -1 log | cut -d" " -f2) &&
+		{
+			(test -f "$clone_fails" &&
+				test "$desc" = "documentaci�n"
+			) ||
+			(! test -f "$clone_fails" &&
+				test "$desc" = "documentación"
+			)
+		}
+	)
+'
+
+test_expect_success 'unset git-p4.fallbackEncoding' '
+	git config --global --unset git-p4.fallbackEncoding
+'
+
+test_done
-- 
2.31.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions
  2021-04-27  5:39                               ` [PATCH v5] " Tzadik Vanderhoof
@ 2021-04-27  5:45                                 ` Tzadik Vanderhoof
  2021-04-28  4:39                                 ` Junio C Hamano
  1 sibling, 0 replies; 24+ messages in thread
From: Tzadik Vanderhoof @ 2021-04-27  5:45 UTC (permalink / raw)
  To: Eric Sunshine, Git List, Johannes Schindelin, Luke Diamand, Pete Wyckoff

I modified the test to work on both Linux and Windows.  See the
comments in the test.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions
  2021-04-27  5:39                               ` [PATCH v5] " Tzadik Vanderhoof
  2021-04-27  5:45                                 ` Tzadik Vanderhoof
@ 2021-04-28  4:39                                 ` Junio C Hamano
  2021-04-28 14:58                                   ` Torsten Bögershausen
  1 sibling, 1 reply; 24+ messages in thread
From: Junio C Hamano @ 2021-04-28  4:39 UTC (permalink / raw)
  To: Tzadik Vanderhoof
  Cc: Eric Sunshine, Git List, Johannes Schindelin, Luke Diamand, Pete Wyckoff

Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> writes:

>  t/t9835-git-p4-config-fallback-encoding.sh | 87 ++++++++++++++++++++++
>  3 files changed, 106 insertions(+), 1 deletion(-)
>  create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh

9835 is already taken (see 'seen').

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions
  2021-04-28  4:39                                 ` Junio C Hamano
@ 2021-04-28 14:58                                   ` Torsten Bögershausen
  2021-04-29  7:39                                     ` [PATCH v6] Add git-p4.fallbackEncoding Tzadik Vanderhoof
       [not found]                                     ` <20210429074458.891-1-tzadik.vanderhoof@gmail.com>
  0 siblings, 2 replies; 24+ messages in thread
From: Torsten Bögershausen @ 2021-04-28 14:58 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Tzadik Vanderhoof, Eric Sunshine, Git List, Johannes Schindelin,
	Luke Diamand, Pete Wyckoff

On Wed, Apr 28, 2021 at 01:39:38PM +0900, Junio C Hamano wrote:
> Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> writes:
>
> >  t/t9835-git-p4-config-fallback-encoding.sh | 87 ++++++++++++++++++++++
> >  3 files changed, 106 insertions(+), 1 deletion(-)
> >  create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh
>
> 9835 is already taken (see 'seen').

In general, this looks good to me.
There are two minor nitpicks to make the patch more the git-way:

> Subject: [PATCH v5] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptionsw

The head line is somewhat too long.
It should be much shorter, like 50-55 characters, if I recall it rigth.
The first line of the commit message is what we see under PATCH in the email,
followed by a blank line (that's what we have) and a detailed description
(Which we have)

How abut this ?

git-p4: Add git-p4.fallbackEncoding

Add git-p4.fallbackEncoding config variable,
to prevent git-p4 from crashing on non UTF-8 changeset descriptions.

When git-p4 reads the output from a p4 command, it assumes it will be
100% UTF-8. If even one character in the output of one p4 command is
not UTF-8, git-p4 crashes with:

    File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
        value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
        decode byte Ox93 in position 42: invalid start byte

This is especially a problem for the "git p4 clone ... @all" command,
where git-p4 needs to read thousands of changeset descriptions, one of
which may have a stray smart quote, causing the whole clone operation
to fail.

Add a new config setting, allowing git-p4 to try a fallback encoding
(for example, "cp1252") and/or use the Unicode replacement character,
to prevent the whole program from crashing on such a minor problem.


[]

And then, somewhere in the test:

			cp /dev/null "$clone_fails" &&

This should create an empty file, right ?
Then we can use a simple output-redirection:

			>"$clone_fails" &&


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v6] Add git-p4.fallbackEncoding
  2021-04-28 14:58                                   ` Torsten Bögershausen
@ 2021-04-29  7:39                                     ` Tzadik Vanderhoof
  2021-04-29  8:36                                       ` Luke Diamand
       [not found]                                     ` <20210429074458.891-1-tzadik.vanderhoof@gmail.com>
  1 sibling, 1 reply; 24+ messages in thread
From: Tzadik Vanderhoof @ 2021-04-29  7:39 UTC (permalink / raw)
  To: Junio C Hamano, Eric Sunshine, Git List, Johannes Schindelin,
	Luke Diamand, Pete Wyckoff
  Cc: Tzadik Vanderhoof

Add git-p4.fallbackEncoding config variable, to prevent git-p4 from
crashing on non UTF-8 changeset descriptions.

When git-p4 reads the output from a p4 command, it assumes it will
be 100% UTF-8. If even one character in the output of one p4 command is
not UTF-8, git-p4 crashes with:

    File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
        value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
        decode byte Ox93 in position 42: invalid start byte

This is especially a problem for the "git p4 clone ... @all" command,
where git-p4 needs to read thousands of changeset descriptions, one of
which may have a stray smart quote, causing the whole clone operation
to fail.

Add a new config setting, allowing git-p4 to try a fallback encoding
(for example, "cp1252") and/or use the Unicode replacement character,
to prevent the whole program from crashing on such a minor problem.

Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
---
 Documentation/git-p4.txt                   |  9 ++
 git-p4.py                                  | 11 ++-
 t/t9836-git-p4-config-fallback-encoding.sh | 98 ++++++++++++++++++++++
 3 files changed, 117 insertions(+), 1 deletion(-)
 create mode 100755 t/t9836-git-p4-config-fallback-encoding.sh

diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
index f89e68b424..86d3ffa644 100644
--- a/Documentation/git-p4.txt
+++ b/Documentation/git-p4.txt
@@ -638,6 +638,15 @@ git-p4.pathEncoding::
 	to transcode the paths to UTF-8. As an example, Perforce on Windows
 	often uses "cp1252" to encode path names.
 
+git-p4.fallbackEncoding::
+	Perforce changeset descriptions can be stored in any encoding.
+	Git-p4 first tries to interpret each description as UTF-8. If that
+	fails, this config allows another encoding to be tried. You can specify,
+	for example, "cp1252". If git-p4.fallbackEncoding is "replace", UTF-8 will
+	be used, with invalid UTF-8 characters replaced by the Unicode replacement
+	character. The default is "none": there is no fallback, and any non UTF-8
+	character will cause git-p4 to immediately fail.
+
 git-p4.largeFileSystem::
 	Specify the system that is used for large (binary) files. Please note
 	that large file systems do not support the 'git p4 submit' command.
diff --git a/git-p4.py b/git-p4.py
index 09c9e93ac4..202fb01bdf 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
                 for key, value in entry.items():
                     key = key.decode()
                     if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
-                        value = value.decode()
+                        try:
+                            value = value.decode()
+                        except UnicodeDecodeError:
+                            fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none'
+                            if fallbackEncoding == 'none':
+                                raise Exception("UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding")
+                            elif fallbackEncoding == 'replace':
+                                value = value.decode(errors='replace')
+                            else:
+                                value = value.decode(encoding=fallbackEncoding)
                     decoded_entry[key] = value
                 # Parse out data if it's an error response
                 if decoded_entry.get('code') == 'error' and 'data' in decoded_entry:
diff --git a/t/t9836-git-p4-config-fallback-encoding.sh b/t/t9836-git-p4-config-fallback-encoding.sh
new file mode 100755
index 0000000000..901bb3759d
--- /dev/null
+++ b/t/t9836-git-p4-config-fallback-encoding.sh
@@ -0,0 +1,98 @@
+#!/bin/sh
+
+test_description='test git-p4.fallbackEncoding config'
+
+GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
+export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
+
+. ./lib-git-p4.sh
+
+test_expect_success 'start p4d' '
+	start_p4d
+'
+
+test_expect_success 'add Unicode description' '
+	cd "$cli" &&
+	echo file1 >file1 &&
+	p4 add file1 &&
+	p4 submit -d documentación
+'
+
+# Unicode descriptions cause "git p4 clone" to crash with a UnicodeDecodeError in some
+# environments. This test determines if that is the case in our environment. If so,
+# we create a file called "clone_fails". In subsequent tests, we check whether that
+# file exists to determine what behavior to expect.
+
+clone_fails="$TRASH_DIRECTORY/clone_fails"
+
+# If clone fails with git-p4.fallbackEncoding set to "none", create the "clone_fails" file,
+# and make sure the error message is correct
+
+test_expect_success 'clone with git-p4.fallbackEncoding set to "none"' '
+	git config --global git-p4.fallbackEncoding none &&
+	test_when_finished cleanup_git && {
+		git p4 clone --dest="$git" //depot@all 2>error || (
+			>"$clone_fails" &&
+			grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error
+		)
+	}
+'
+
+# If clone fails with git-p4.fallbackEncoding set to "none", it should also fail when it's unset,
+# also with the correct error message.  Otherwise the clone should succeed.
+
+test_expect_success 'clone with git-p4.fallbackEncoding unset' '
+	git config --global --unset git-p4.fallbackEncoding &&
+	test_when_finished cleanup_git && {
+		(
+			test -f "$clone_fails" &&
+			test_must_fail git p4 clone --dest="$git" //depot@all 2>error &&
+			grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error
+		) ||
+		(
+			! test -f "$clone_fails" &&
+			git p4 clone --dest="$git" //depot@all 2>error
+		)
+	}
+'
+
+# Whether or not "clone_fails" exists, setting git-p4.fallbackEncoding
+# to "cp1252" should cause clone to succeed and get the right description
+
+test_expect_success 'clone with git-p4.fallbackEncoding set to "cp1252"' '
+	git config --global git-p4.fallbackEncoding cp1252 &&
+	test_when_finished cleanup_git &&
+	(
+		git p4 clone --dest="$git" //depot@all &&
+		cd "$git" &&
+		git log --oneline >log &&
+		desc=$(head -1 log | cut -d" " -f2) &&
+		test "$desc" = "documentación"
+	)
+'
+
+# Setting git-p4.fallbackEncoding to "replace" should always cause clone to succeed.
+# If "clone_fails" exists, the description should contain the Unicode replacement
+# character, otherwise the description should be correct (since we're on a system that
+# doesn't have the Unicode issue)
+
+test_expect_success 'clone with git-p4.fallbackEncoding set to "replace"' '
+	git config --global git-p4.fallbackEncoding replace &&
+	test_when_finished cleanup_git &&
+	(
+		git p4 clone --dest="$git" //depot@all &&
+		cd "$git" &&
+		git log --oneline >log &&
+		desc=$(head -1 log | cut -d" " -f2) &&
+		{
+			(test -f "$clone_fails" &&
+				test "$desc" = "documentaci�n"
+			) ||
+			(! test -f "$clone_fails" &&
+				test "$desc" = "documentación"
+			)
+		}
+	)
+'
+
+test_done
-- 
2.31.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6] Add git-p4.fallbackEncoding
  2021-04-29  7:39                                     ` [PATCH v6] Add git-p4.fallbackEncoding Tzadik Vanderhoof
@ 2021-04-29  8:36                                       ` Luke Diamand
  2021-04-29 17:29                                         ` Tzadik Vanderhoof
  0 siblings, 1 reply; 24+ messages in thread
From: Luke Diamand @ 2021-04-29  8:36 UTC (permalink / raw)
  To: Tzadik Vanderhoof, Junio C Hamano, Eric Sunshine, Git List,
	Johannes Schindelin, Pete Wyckoff, Anders Bakken
  Cc: Andrew Oakley



On 29/04/2021 07:39, Tzadik Vanderhoof wrote:
> Add git-p4.fallbackEncoding config variable, to prevent git-p4 from
> crashing on non UTF-8 changeset descriptions.
> 
> When git-p4 reads the output from a p4 command, it assumes it will
> be 100% UTF-8. If even one character in the output of one p4 command is
> not UTF-8, git-p4 crashes with:
> 
>      File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
>          value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
>          decode byte Ox93 in position 42: invalid start byte
> 
> This is especially a problem for the "git p4 clone ... @all" command,
> where git-p4 needs to read thousands of changeset descriptions, one of
> which may have a stray smart quote, causing the whole clone operation
> to fail.
> 
> Add a new config setting, allowing git-p4 to try a fallback encoding
> (for example, "cp1252") and/or use the Unicode replacement character,
> to prevent the whole program from crashing on such a minor problem.

I think Andrew Oakley pointed this out earlier - but in the days of 
Python2 this was (I think) never a problem. Python2 just took in the 
binary data, in whatever encoding, and passed it untouched on to git, 
which in turn just stored it.

https://lore.kernel.org/git/20210412085251.51475-1-andrew@adoakley.name/

It was only whatever was trying to render the bytestream that needed to 
worry about the encoding.

Now we're making the decision in git-p4 when we ingest it - did we 
consider just passing it along untouched?

The problem at hand is that git-p4 is trying to store it internally as a 
`string' which now is unicode-aware, when perhaps it should not be.

It's going to get very confusing if anyone ingests something from 
Perforce having set the encoding to one thing, and it turns out to be a 
different encoding, or worse, multiple encodings for the same repo.

I also worry that if someone has connected to a Unicode-aware Perforce 
server, and then unwittingly set P4CHARSET, then are we going to end up 
silently scrambling everything?

I'm not sure, encodings make my head hurt.

Luke



> 
> Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
> ---
>   Documentation/git-p4.txt                   |  9 ++
>   git-p4.py                                  | 11 ++-
>   t/t9836-git-p4-config-fallback-encoding.sh | 98 ++++++++++++++++++++++
>   3 files changed, 117 insertions(+), 1 deletion(-)
>   create mode 100755 t/t9836-git-p4-config-fallback-encoding.sh
> 
> diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
> index f89e68b424..86d3ffa644 100644
> --- a/Documentation/git-p4.txt
> +++ b/Documentation/git-p4.txt
> @@ -638,6 +638,15 @@ git-p4.pathEncoding::
>   	to transcode the paths to UTF-8. As an example, Perforce on Windows
>   	often uses "cp1252" to encode path names.
>   
> +git-p4.fallbackEncoding::
> +	Perforce changeset descriptions can be stored in any encoding.
> +	Git-p4 first tries to interpret each description as UTF-8. If that
> +	fails, this config allows another encoding to be tried. You can specify,
> +	for example, "cp1252". If git-p4.fallbackEncoding is "replace", UTF-8 will
> +	be used, with invalid UTF-8 characters replaced by the Unicode replacement
> +	character. The default is "none": there is no fallback, and any non UTF-8
> +	character will cause git-p4 to immediately fail.
> +
>   git-p4.largeFileSystem::
>   	Specify the system that is used for large (binary) files. Please note
>   	that large file systems do not support the 'git p4 submit' command.
> diff --git a/git-p4.py b/git-p4.py
> index 09c9e93ac4..202fb01bdf 100755
> --- a/git-p4.py
> +++ b/git-p4.py
> @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
>                   for key, value in entry.items():
>                       key = key.decode()
>                       if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
> -                        value = value.decode()
> +                        try:
> +                            value = value.decode()
> +                        except UnicodeDecodeError:
> +                            fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none'
> +                            if fallbackEncoding == 'none':
> +                                raise Exception("UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding")
> +                            elif fallbackEncoding == 'replace':
> +                                value = value.decode(errors='replace')
> +                            else:
> +                                value = value.decode(encoding=fallbackEncoding)
>                       decoded_entry[key] = value
>                   # Parse out data if it's an error response
>                   if decoded_entry.get('code') == 'error' and 'data' in decoded_entry:
> diff --git a/t/t9836-git-p4-config-fallback-encoding.sh b/t/t9836-git-p4-config-fallback-encoding.sh
> new file mode 100755
> index 0000000000..901bb3759d
> --- /dev/null
> +++ b/t/t9836-git-p4-config-fallback-encoding.sh
> @@ -0,0 +1,98 @@
> +#!/bin/sh
> +
> +test_description='test git-p4.fallbackEncoding config'
> +
> +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
> +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
> +
> +. ./lib-git-p4.sh
> +
> +test_expect_success 'start p4d' '
> +	start_p4d
> +'
> +
> +test_expect_success 'add Unicode description' '
> +	cd "$cli" &&
> +	echo file1 >file1 &&
> +	p4 add file1 &&
> +	p4 submit -d documentación
> +'
> +
> +# Unicode descriptions cause "git p4 clone" to crash with a UnicodeDecodeError in some
> +# environments. This test determines if that is the case in our environment. If so,
> +# we create a file called "clone_fails". In subsequent tests, we check whether that
> +# file exists to determine what behavior to expect.
> +
> +clone_fails="$TRASH_DIRECTORY/clone_fails"
> +
> +# If clone fails with git-p4.fallbackEncoding set to "none", create the "clone_fails" file,
> +# and make sure the error message is correct
> +
> +test_expect_success 'clone with git-p4.fallbackEncoding set to "none"' '
> +	git config --global git-p4.fallbackEncoding none &&
> +	test_when_finished cleanup_git && {
> +		git p4 clone --dest="$git" //depot@all 2>error || (
> +			>"$clone_fails" &&
> +			grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error
> +		)
> +	}
> +'
> +
> +# If clone fails with git-p4.fallbackEncoding set to "none", it should also fail when it's unset,
> +# also with the correct error message.  Otherwise the clone should succeed.
> +
> +test_expect_success 'clone with git-p4.fallbackEncoding unset' '
> +	git config --global --unset git-p4.fallbackEncoding &&
> +	test_when_finished cleanup_git && {
> +		(
> +			test -f "$clone_fails" &&
> +			test_must_fail git p4 clone --dest="$git" //depot@all 2>error &&
> +			grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error
> +		) ||
> +		(
> +			! test -f "$clone_fails" &&
> +			git p4 clone --dest="$git" //depot@all 2>error
> +		)
> +	}
> +'
> +
> +# Whether or not "clone_fails" exists, setting git-p4.fallbackEncoding
> +# to "cp1252" should cause clone to succeed and get the right description
> +
> +test_expect_success 'clone with git-p4.fallbackEncoding set to "cp1252"' '
> +	git config --global git-p4.fallbackEncoding cp1252 &&
> +	test_when_finished cleanup_git &&
> +	(
> +		git p4 clone --dest="$git" //depot@all &&
> +		cd "$git" &&
> +		git log --oneline >log &&
> +		desc=$(head -1 log | cut -d" " -f2) &&
> +		test "$desc" = "documentación"
> +	)
> +'
> +
> +# Setting git-p4.fallbackEncoding to "replace" should always cause clone to succeed.
> +# If "clone_fails" exists, the description should contain the Unicode replacement
> +# character, otherwise the description should be correct (since we're on a system that
> +# doesn't have the Unicode issue)
> +
> +test_expect_success 'clone with git-p4.fallbackEncoding set to "replace"' '
> +	git config --global git-p4.fallbackEncoding replace &&
> +	test_when_finished cleanup_git &&
> +	(
> +		git p4 clone --dest="$git" //depot@all &&
> +		cd "$git" &&
> +		git log --oneline >log &&
> +		desc=$(head -1 log | cut -d" " -f2) &&
> +		{
> +			(test -f "$clone_fails" &&
> +				test "$desc" = "documentaci�n"
> +			) ||
> +			(! test -f "$clone_fails" &&
> +				test "$desc" = "documentación"
> +			)
> +		}
> +	)
> +'
> +
> +test_done
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6] Add git-p4.fallbackEncoding
       [not found]                                       ` <c4c48615-d1f4-fd37-0960-979535907f15@web.de>
@ 2021-04-29 17:14                                         ` Tzadik Vanderhoof
  0 siblings, 0 replies; 24+ messages in thread
From: Tzadik Vanderhoof @ 2021-04-29 17:14 UTC (permalink / raw)
  To: Torsten Bögershausen, Git List

On Thu, Apr 29, 2021 at 7:12 AM Torsten Bögershausen <tboegi@web.de> wrote:
>
> Hej Tzadik,
>
> This version went only to my email ?
>

v6 went to the list as well as your email.  I just forgot to include
you in the email to the list, so I sent you another copy with just
you.

> The test case number seems to be fixed, thanks.
>
> (Normally we don't have this collision, but right now
> it seem as if there is much going on in the git-p4 area,
> which is good)
>
> The "headline" is still overlong, it seams.
>

I did shorten the first line of my commit as you asked and used that
commit to create the v6 path. That first (short) line goes into the
Subject line of the patch. When you do "git am" it will use the
subject (which is short) as the first line of the commit. The overlong
summary will become the 3rd line of the commit (after a blank second
line)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6] Add git-p4.fallbackEncoding
  2021-04-29  8:36                                       ` Luke Diamand
@ 2021-04-29 17:29                                         ` Tzadik Vanderhoof
  0 siblings, 0 replies; 24+ messages in thread
From: Tzadik Vanderhoof @ 2021-04-29 17:29 UTC (permalink / raw)
  To: Luke Diamand
  Cc: Junio C Hamano, Eric Sunshine, Git List, Johannes Schindelin,
	Pete Wyckoff, Anders Bakken, Andrew Oakley

On Thu, Apr 29, 2021 at 1:36 AM Luke Diamand <luke@diamand.org> wrote:
>
> I think Andrew Oakley pointed this out earlier - but in the days of
> Python2 this was (I think) never a problem. Python2 just took in the
> binary data, in whatever encoding, and passed it untouched on to git,
> which in turn just stored it.
>

Unfortunately, I just became aware yesterday that Andrew Oakley was
also working on this issue (his CC to me 2 weeks ago somehow ended up
in my Spam folder, and I only dug it out of there after finding out
that we both created a test with the same number).

When I first became aware of Andrew's work (yesterday), I thought it
would make mine unnecessary, but upon further investigation, I don't
think Andrew's work will solve this problem.  Please see my reply
yesterday to Andew's thread:
https://lore.kernel.org/git/CAKu1iLXRrsB4mRsDfhBH5aahWzDjpfqLuWP9t47RMB=RdpL1iA@mail.gmail.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2021-04-29 17:29 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-08 19:28 git-p4 crashes on non UTF-8 output from p4 Tzadik Vanderhoof
2021-04-09 15:38 ` Torsten Bögershausen
2021-04-11  7:16   ` Tzadik Vanderhoof
2021-04-11  9:37     ` Torsten Bögershausen
2021-04-11 20:21       ` Tzadik Vanderhoof
2021-04-12  4:06         ` Torsten Bögershausen
2021-04-21  8:46           ` [PATCH] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions Tzadik Vanderhoof
2021-04-21  8:55             ` Tzadik Vanderhoof
2021-04-22  5:05               ` [PATCH v3] add git-p4.fallbackEncoding config setting, " Tzadik Vanderhoof
2021-04-22 15:50                 ` Torsten Bögershausen
2021-04-22 16:17                   ` Eric Sunshine
2021-04-22 22:33                     ` Eric Sunshine
2021-04-23  6:36                       ` [PATCH] add git-p4.fallbackEncoding config variable, " Tzadik Vanderhoof
2021-04-23  6:44                         ` Tzadik Vanderhoof
2021-04-23 19:08                           ` Tzadik Vanderhoof
2021-04-24  8:14                             ` Torsten Bögershausen
2021-04-27  5:39                               ` [PATCH v5] " Tzadik Vanderhoof
2021-04-27  5:45                                 ` Tzadik Vanderhoof
2021-04-28  4:39                                 ` Junio C Hamano
2021-04-28 14:58                                   ` Torsten Bögershausen
2021-04-29  7:39                                     ` [PATCH v6] Add git-p4.fallbackEncoding Tzadik Vanderhoof
2021-04-29  8:36                                       ` Luke Diamand
2021-04-29 17:29                                         ` Tzadik Vanderhoof
     [not found]                                     ` <20210429074458.891-1-tzadik.vanderhoof@gmail.com>
     [not found]                                       ` <c4c48615-d1f4-fd37-0960-979535907f15@web.de>
2021-04-29 17:14                                         ` Tzadik Vanderhoof

git@vger.kernel.org list mirror (unofficial, one of many)

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 git git/ https://public-inbox.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://7fh6tueqddpjyxjmgtdiueylzoqt6pt7hec3pukyptlmohoowvhde4yd.onion/inbox.comp.version-control.git
	nntp://ie5yzdi7fg72h7s4sdcztq5evakq23rdt33mfyfcddc5u3ndnw24ogqd.onion/inbox.comp.version-control.git
	nntp://4uok3hntl7oi7b4uf4rtfwefqeexfzil2w6kgk2jn5z2f764irre7byd.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git