From: Joel Holdsworth <jholdsworth@nvidia.com>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>,
Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
Subject: RE: [PATCH 0/6] Transition git-p4.py to support Python 3 only
Date: Fri, 10 Dec 2021 10:37:56 +0000 [thread overview]
Message-ID: <BN8PR12MB3361388476E57E097DEBF9F7C8719@BN8PR12MB3361.namprd12.prod.outlook.com> (raw)
In-Reply-To: <211210.86r1ale0o0.gmgdl@evledraar.gmail.com>
> The commit messages could just really use some extra hand-holding and
> explanation, and a clear split-out of things related to the version bump v.s.
> things not needed for that, or unrelated refactorings.
Yes, I am getting this message loud and clear. I will resubmit with more detailed commit messages today.
To explain the story here: I started using git-p4 as part of my work-flow, and I expect to need it for several years to come. As I began to use it, I found that a series of bugs - mostly related to character encoding. In fixing these, I found that some of the troubles were specific to Python 3 - or rather Python 2's less strict approach to distinguishing between byte sequences and textual strings allowed the script to proceed Python 2 even though what it was doing was in fact invalid, and was potentially corrupting data.
A common problem that users are encountering is that the script attempts to decode incoming textual byte-streams into UTF-8 strings. On Python 3 this fails with an exception if the data contains invalid UTF-8 codes. For text files created in Windows, CP1252 Smart Quote characters: 0x93 and 0x94 are seen fairly frequently. These codes are invalid in UTF-8, so if the script encounters any file or file name containing them, it will fail with an exception.
Tzadik Vanderhoof submitted a patch attempting to fix some of these issues in April 2021:
https://lore.kernel.org/git/20210429073905.837-1-tzadik.vanderhoof@gmail.com/
My two comments about this patch are that 1. It doesn't fix my issue, and 2. Even with the proposed fallbackEncoding option it still leaves git-p4 broken by default.
A fallbackEncoding option may still be necessary, but I found that most of the issues I encountered could be side-stepped by simply avoiding decoding incoming data into UTF-8 in the first place.
Keeping a clean separation between encoded and decoded text is much easier to do in Python 3. If Python 2 support must be maintained, this will require careful testing of separate code-paths both platforms which I don't regard as reasonable given that Python 2 is thoroughly deprecated. Therefore, this first patch-set focusses primarily on removing Python 2 support.
Furthermore, because I expect to be using git-p4 in my daily work-flow for some time to come, I am interested in investing some effort into improving it. There are bugs, unreliable behaviour, user-hostile behaviour, as well as code that would benefit from clean-up and modernisation. In submitting these patches, I am trying to get a read on to what extent such efforts would be accepted by the Git maintainers.
Is it preferable that patch-sets have a tight focus on a single topic? I am already dividing up my full patch collection. I can divide it further if requested. I am happy to do this, I was just worried that it just might make longer to get all my patches through review.
> Some of these changes also just seem to be entirely unrelated refactorings,
> e.g. 6/6 where you're changing a multi-line commented regexp into
> something that's a dense one-liner. Does Python 3 not support the
> equivalent of Perl's /x, or is something else going on here?
I will improve the commit message to explain the changes being made here.
The regexp is matching RCS keywords: https://www.perforce.com/manuals/p4guide/Content/P4Guide/filetypes.rcs.html - $File$, $Author$, $Author$ etc., a very simple match. We could keep it multi-line, though this seems overkill to me.
The main significance of this change that previously git-p4 would compile one of these two regexes for every single file processed. This patch just pre-compiles the two regexes (now binary regexes, not utf-8 regexes) at the start of the script.
next prev parent reply other threads:[~2021-12-10 10:38 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-12-09 20:10 [PATCH 0/6] Transition git-p4.py to support Python 3 only Joel Holdsworth
2021-12-09 20:10 ` [PATCH 1/6] git-p4: Always pass cmd arguments to subprocess as a python lists Joel Holdsworth
2021-12-09 22:42 ` Junio C Hamano
2021-12-09 20:10 ` [PATCH 2/6] git-p4: Don't print shell commands as " Joel Holdsworth
2021-12-09 20:10 ` [PATCH 3/6] git-p4: Removed support for Python 2 Joel Holdsworth
2021-12-09 22:44 ` Junio C Hamano
2021-12-09 23:07 ` rsbecker
2021-12-10 3:25 ` David Aguilar
2021-12-10 10:44 ` Joel Holdsworth
2021-12-09 20:10 ` [PATCH 4/6] git-p4: Decode byte strings before printing Joel Holdsworth
2021-12-09 22:47 ` Junio C Hamano
2021-12-10 8:40 ` Fabian Stelzer
2021-12-10 10:48 ` Joel Holdsworth
2021-12-10 10:41 ` Joel Holdsworth
2021-12-09 20:10 ` [PATCH 5/6] git-p4: Eliminate decode_stream and encode_stream Joel Holdsworth
2021-12-09 20:10 ` [PATCH 6/6] git-p4: Resolve RCS keywords in binary Joel Holdsworth
2021-12-10 7:57 ` Luke Diamand
2021-12-10 10:51 ` Joel Holdsworth
2021-12-10 0:48 ` [PATCH 0/6] Transition git-p4.py to support Python 3 only Ævar Arnfjörð Bjarmason
2021-12-10 10:37 ` Joel Holdsworth [this message]
2021-12-10 11:30 ` Ævar Arnfjörð Bjarmason
2021-12-10 21:34 ` Junio C Hamano
2021-12-10 21:53 ` rsbecker
2021-12-11 21:00 ` Elijah Newren
2021-12-12 8:55 ` Luke Diamand
2021-12-10 7:53 ` Luke Diamand
2021-12-10 10:54 ` Joel Holdsworth
2021-12-11 9:58 ` Luke Diamand
2021-12-13 13:47 ` Joel Holdsworth
2021-12-13 19:29 ` Junio C Hamano
2021-12-13 19:58 ` Joel Holdsworth
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=BN8PR12MB3361388476E57E097DEBF9F7C8719@BN8PR12MB3361.namprd12.prod.outlook.com \
--to=jholdsworth@nvidia.com \
--cc=avarab@gmail.com \
--cc=git@vger.kernel.org \
--cc=tzadik.vanderhoof@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).