git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Torsten Bögershausen" <tboegi@web.de>
To: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: git-p4 crashes on non UTF-8 output from p4
Date: Sun, 11 Apr 2021 11:37:46 +0200	[thread overview]
Message-ID: <20210411093746.ymqofe2uawclwu5i@tb-raspi4> (raw)
In-Reply-To: <CAKu1iLX1AyTCSGxDVgiR1cr4=4ODD-gn8jHAinhp7OhDChAf1A@mail.gmail.com>

On Sun, Apr 11, 2021 at 12:16:25AM -0700, Tzadik Vanderhoof wrote:
> Here is the pull request:

Thanks for the work. Some comments inline.

>
> From 8d234af842223dceae76ce0affd3bbb3f17bb6d9 Mon Sep 17 00:00:00 2001
> From: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
> Date: Sat, 10 Apr 2021 22:41:39 -0700


The subject should be one short line, highlighting what this is all about,
followed by a blank line and a longer description about the problem and
the solution. The original description was good, see below.

> Subject: [PATCH] add git-p4.fallbackEncoding config variable, to prevent
>  git-p4 from crashing on non UTF-8 changeset descriptions

In that sense I make a first trial here, subject for improvements:


Subject: [PATCH] Add git-p4.fallbackEncoding config variable

When git-p4 reads the output from a p4 command, it assumes it will be
100% UTF-8. If even one character in the output of one p4 command is
not UTF-8, git-p4 crashes e.g. with:

File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
    value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
decode byte Ox93 in position 42: invalid start byte

Allow to try another encoding (eg cp1252) and/or use the
Unicode replacement character  to prevent the whole program from crashing
on such a "minor" problem.

This is especially a problem on the "git p4 clone" command with @all,
where git-p4 needs to read thousands of changeset descriptions, one of
which may have a stray smart quote, causing the whole clone operation
to fail.

Introduce "git-p4.fallbackEncoding" to handle non UTF-8 encodings, if needed.

>
> ---
>  Documentation/git-p4.txt | 10 ++++++++++
>  git-p4.py                | 11 ++++++++++-
>  2 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
> index f89e68b..71f3487 100644
> --- a/Documentation/git-p4.txt
> +++ b/Documentation/git-p4.txt
> @@ -638,6 +638,16 @@ git-p4.pathEncoding::
>   to transcode the paths to UTF-8. As an example, Perforce on Windows
>   often uses "cp1252" to encode path names.
>
> +git-p4.fallbackEncoding::
> +    Perforce changeset descriptions can be in a mixture of encodings. Git-p4
> +    first tries to interpret each description as UTF-8. If that fails, this
> +    config allows another encoding to be tried.  The default is "cp1252".  You

I know that cp1252 is attractive to be used, especially for Windows installations that
use Latin-based "characters".
But: If we introduce a new config-variable into Git, the default tends to be
"if not set to anything, behave as the old Git".

> +    can set it to another encoding, for example, "iso-8859-5". If instead of
ISO-8859-5 may be more portable on the different i18 implementations
than the lower-case spelling.

> +    an encoding, you specify "replace", UTF-8 will be used, with invalid UTF-8
> +    characters replaced by the Unicode replacement character. If you specify
> +    "none", there is no fallback, and any non UTF-8 character will cause
> +    git-p4 to immediately fail.

As said, before, many people may expect Git to fail, so that the default should be
none to avoid surprises.
When a "non-UTF-8-clean" repo is handled, they want to know it.

> +
>  git-p4.largeFileSystem::
>   Specify the system that is used for large (binary) files. Please note
>   that large file systems do not support the 'git p4 submit' command.
> diff --git a/git-p4.py b/git-p4.py
> index 09c9e93..18d02b4 100755
> --- a/git-p4.py
> +++ b/git-p4.py
> @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b',
> cb=None, skip_info=False,
>                  for key, value in entry.items():
>                      key = key.decode()
>                      if isinstance(value, bytes) and not (key in
> ('data', 'path', 'clientFile') or key.startswith('depotFile')):
> -                        value = value.decode()
> +                        try:
> +                            value = value.decode()
> +                        except:
> +                            fallbackEncoding =
> gitConfig("git-p4.fallbackEncoding").lower() or 'cp1252'
> +                            if fallbackEncoding == 'none':
> +                                raise

Would it make sense to tell the user about the new config value here?
 raise Exception("Non UTF-8 detected. See git-p4.fallbackEncoding"
Or somewhat in that style ?

> +                            elif fallbackEncoding == 'replace':
> +                                value = value.decode(errors='replace')
> +                            else:
> +                                value = value.decode(encoding=fallbackEncoding)
>                      decoded_entry[key] = value
>                  # Parse out data if it's an error response
>                  if decoded_entry.get('code') == 'error' and 'data' in
> decoded_entry:


Did I miss the Signed-off-by here?

Please have a look here:
https://git-scm.com/docs/SubmittingPatches

(or look at Documentation/SubmittingPatches in your git source code)

And finally: Thanks for the contribution.
Is there any chance to add test-cases, to make sure that this feature
is well-tested now and in the future ?


> --
> 2.31.1.windows.1
>
> On Fri, Apr 9, 2021 at 8:38 AM Torsten Bögershausen <tboegi@web.de> wrote:
> >
> > On Thu, Apr 08, 2021 at 12:28:25PM -0700, Tzadik Vanderhoof wrote:
> > > When git-p4 reads the output from a p4 command, it assumes it will be
> > > 100% UTF-8. If even one character in the output of one p4 command is
> > > not UTF-8, git-p4 crashes with:
> > >
> > > File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
> > >     value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
> > > decode byte Ox93 in position 42: invalid start byte
> > >
> > > I'd like to make a pull request to have it try another encoding (eg
> > > cp1252) and/or use the Unicode replacement character, to prevent the
> > > whole program from crashing on such a minor problem.
> > >
> > > This is especially a problem on the "git p4 clone" command with @all,
> > > where git-p4 needs to read thousands of changeset descriptions, one of
> > > which may have a stray smart quote, causing the whole clone operation
> > > to fail.
> > >
> > > Sound ok?
> >
> > Welcome to the Git community.
> > To start with: I am not a git-p4 expert as such, but seeing that a program is crashing
> > is never a good thing.
> > All efforts to prevent the crash are a step forward.
> >
> > As you mention cp1252 (which is more used under Windows), there are probably lots of
> > system out there which use ISO-8859-15 (or ISO-8859-1) we may have the first whish:
> >
> > Make the encoding/fallback configurable.
> > Let people choose if they want a crash (if things are broken),
> > fallback to cp1252 or one of the other ISO-ISO-8859-x encodings.
> >
> > In that sense: we look forward to a pull-request.
>
>
>
> --
> Tzadik

  reply	other threads:[~2021-04-11  9:37 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-08 19:28 git-p4 crashes on non UTF-8 output from p4 Tzadik Vanderhoof
2021-04-09 15:38 ` Torsten Bögershausen
2021-04-11  7:16   ` Tzadik Vanderhoof
2021-04-11  9:37     ` Torsten Bögershausen [this message]
2021-04-11 20:21       ` Tzadik Vanderhoof
2021-04-12  4:06         ` Torsten Bögershausen
2021-04-21  8:46           ` [PATCH] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions Tzadik Vanderhoof
2021-04-21  8:55             ` Tzadik Vanderhoof
2021-04-22  5:05               ` [PATCH v3] add git-p4.fallbackEncoding config setting, " Tzadik Vanderhoof
2021-04-22 15:50                 ` Torsten Bögershausen
2021-04-22 16:17                   ` Eric Sunshine
2021-04-22 22:33                     ` Eric Sunshine
2021-04-23  6:36                       ` [PATCH] add git-p4.fallbackEncoding config variable, " Tzadik Vanderhoof
2021-04-23  6:44                         ` Tzadik Vanderhoof
2021-04-23 19:08                           ` Tzadik Vanderhoof
2021-04-24  8:14                             ` Torsten Bögershausen
2021-04-27  5:39                               ` [PATCH v5] " Tzadik Vanderhoof
2021-04-27  5:45                                 ` Tzadik Vanderhoof
2021-04-28  4:39                                 ` Junio C Hamano
2021-04-28 14:58                                   ` Torsten Bögershausen
2021-04-29  7:39                                     ` [PATCH v6] Add git-p4.fallbackEncoding Tzadik Vanderhoof
2021-04-29  8:36                                       ` Luke Diamand
2021-04-29 17:29                                         ` Tzadik Vanderhoof
     [not found]                                     ` <20210429074458.891-1-tzadik.vanderhoof@gmail.com>
     [not found]                                       ` <c4c48615-d1f4-fd37-0960-979535907f15@web.de>
2021-04-29 17:14                                         ` Tzadik Vanderhoof

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210411093746.ymqofe2uawclwu5i@tb-raspi4 \
    --to=tboegi@web.de \
    --cc=git@vger.kernel.org \
    --cc=tzadik.vanderhoof@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).