git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Joel Holdsworth <jholdsworth@nvidia.com>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>,
	Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
Subject: RE: [PATCH 0/6] Transition git-p4.py to support Python 3 only
Date: Fri, 10 Dec 2021 10:37:56 +0000	[thread overview]
Message-ID: <BN8PR12MB3361388476E57E097DEBF9F7C8719@BN8PR12MB3361.namprd12.prod.outlook.com> (raw)
In-Reply-To: <211210.86r1ale0o0.gmgdl@evledraar.gmail.com>

> The commit messages could just really use some extra hand-holding and
> explanation, and a clear split-out of things related to the version bump v.s.
> things not needed for that, or unrelated refactorings.

Yes, I am getting this message loud and clear. I will resubmit with more detailed commit messages today.

To explain the story here: I started using git-p4 as part of my work-flow, and I expect to need it for several years to come. As I began to use it, I found that a series of bugs - mostly related to character encoding. In fixing these, I found that some of the troubles were specific to Python 3 - or rather Python 2's less strict approach to distinguishing between byte sequences and textual strings allowed the script to proceed Python 2 even though what it was doing was in fact invalid, and was potentially corrupting data.

A common problem that users are encountering is that the script attempts to decode incoming textual byte-streams into UTF-8 strings. On Python 3 this fails with an exception if the data contains invalid UTF-8 codes. For text files created in Windows, CP1252 Smart Quote characters: 0x93 and 0x94 are seen fairly frequently. These codes are invalid in UTF-8, so if the script encounters any file or file name containing them, it will fail with an exception.

Tzadik Vanderhoof submitted a patch attempting to fix some of these issues in April 2021:
https://lore.kernel.org/git/20210429073905.837-1-tzadik.vanderhoof@gmail.com/

My two comments about this patch are that 1. It doesn't fix my issue, and 2. Even with the proposed fallbackEncoding option it still leaves git-p4 broken by default.

A fallbackEncoding option may still be necessary, but I found that most of the issues I encountered could be side-stepped by simply avoiding decoding incoming data into UTF-8 in the first place.

Keeping a clean separation between encoded and decoded text is much easier to do in Python 3. If Python 2 support must be maintained, this will require careful testing of separate code-paths both platforms which I don't regard as reasonable given that Python 2 is thoroughly deprecated. Therefore, this first patch-set focusses primarily on removing Python 2 support.

Furthermore, because I expect to be using git-p4 in my daily work-flow for some time to come, I am interested in investing some effort into improving it. There are bugs, unreliable behaviour, user-hostile behaviour, as well as code that would benefit from clean-up and modernisation. In submitting these patches, I am trying to get a read on to what extent such efforts would be accepted by the Git maintainers. 

Is it preferable that patch-sets have a tight focus on a single topic? I am already dividing up my full patch collection. I can divide it further if requested. I am happy to do this, I was just worried that it just might make longer to get all my patches through review.


> Some of these changes also just seem to be entirely unrelated refactorings,
> e.g. 6/6 where you're changing a multi-line commented regexp into
> something that's a dense one-liner. Does Python 3 not support the
> equivalent of Perl's /x, or is something else going on here?

I will improve the commit message to explain the changes being made here.

The regexp is matching RCS keywords: https://www.perforce.com/manuals/p4guide/Content/P4Guide/filetypes.rcs.html - $File$, $Author$, $Author$ etc., a very simple match. We could keep it multi-line, though this seems overkill to me.

The main significance of this change that previously git-p4 would compile one of these two regexes for every single file processed. This patch just pre-compiles the two regexes (now binary regexes, not utf-8 regexes) at the start of the script.
 

  reply	other threads:[~2021-12-10 10:38 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-09 20:10 [PATCH 0/6] Transition git-p4.py to support Python 3 only Joel Holdsworth
2021-12-09 20:10 ` [PATCH 1/6] git-p4: Always pass cmd arguments to subprocess as a python lists Joel Holdsworth
2021-12-09 22:42   ` Junio C Hamano
2021-12-09 20:10 ` [PATCH 2/6] git-p4: Don't print shell commands as " Joel Holdsworth
2021-12-09 20:10 ` [PATCH 3/6] git-p4: Removed support for Python 2 Joel Holdsworth
2021-12-09 22:44   ` Junio C Hamano
2021-12-09 23:07     ` rsbecker
2021-12-10  3:25   ` David Aguilar
2021-12-10 10:44     ` Joel Holdsworth
2021-12-09 20:10 ` [PATCH 4/6] git-p4: Decode byte strings before printing Joel Holdsworth
2021-12-09 22:47   ` Junio C Hamano
2021-12-10  8:40     ` Fabian Stelzer
2021-12-10 10:48       ` Joel Holdsworth
2021-12-10 10:41     ` Joel Holdsworth
2021-12-09 20:10 ` [PATCH 5/6] git-p4: Eliminate decode_stream and encode_stream Joel Holdsworth
2021-12-09 20:10 ` [PATCH 6/6] git-p4: Resolve RCS keywords in binary Joel Holdsworth
2021-12-10  7:57   ` Luke Diamand
2021-12-10 10:51     ` Joel Holdsworth
2021-12-10  0:48 ` [PATCH 0/6] Transition git-p4.py to support Python 3 only Ævar Arnfjörð Bjarmason
2021-12-10 10:37   ` Joel Holdsworth [this message]
2021-12-10 11:30     ` Ævar Arnfjörð Bjarmason
2021-12-10 21:34   ` Junio C Hamano
2021-12-10 21:53     ` rsbecker
2021-12-11 21:00     ` Elijah Newren
2021-12-12  8:55       ` Luke Diamand
2021-12-10  7:53 ` Luke Diamand
2021-12-10 10:54   ` Joel Holdsworth
2021-12-11  9:58     ` Luke Diamand
2021-12-13 13:47       ` Joel Holdsworth
2021-12-13 19:29         ` Junio C Hamano
2021-12-13 19:58           ` Joel Holdsworth

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BN8PR12MB3361388476E57E097DEBF9F7C8719@BN8PR12MB3361.namprd12.prod.outlook.com \
    --to=jholdsworth@nvidia.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=tzadik.vanderhoof@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).