git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Tao Klerks <tao@klerks.biz>
To: Joel Holdsworth <jholdsworth@nvidia.com>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>,
	Luke Diamand <luke@diamand.org>,
	Junio C Hamano <gitster@pobox.com>,
	Eric Sunshine <sunshine@sunshineco.com>,
	Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>,
	Dorgon Chang <dorgonman@hotmail.com>,
	Joachim Kuebart <joachim.kuebart@gmail.com>,
	Daniel Levin <dendy.ua@gmail.com>,
	Johannes Schindelin <Johannes.Schindelin@gmx.de>,
	Ben Keene <seraphire@gmail.com>,
	Andrew Oakley <andrew@adoakley.name>
Subject: Re: [PATCH v4 00/22] git-p4: Various code tidy-ups
Date: Tue, 5 Apr 2022 19:16:23 +0200	[thread overview]
Message-ID: <CAPMMpojFrDL9v=fWfyBx-Ko7fdZkd0yroC058n0+KAvL8SPiYA@mail.gmail.com> (raw)
In-Reply-To: <BL0PR12MB4849BAE614E63BBB0B77EC29C8E49@BL0PR12MB4849.namprd12.prod.outlook.com>

On Tue, Apr 5, 2022 at 2:04 PM Joel Holdsworth <jholdsworth@nvidia.com> wrote:
>
> > So, an initial test suggests that a recent version of git-p4 at least doesn't fail in
> > the same way under python3, in the face of at least some of these encoding
> > issues. I don't know yet whether failures will occur in other places, nor
> > whether the not-failing behavior is better, worse or the same as I had under
> > python2, but it seems plausible that I won't be filing any test_expect_failure
> > tests after all, and will instead say "yay, python3 ftw!"
>
> That would be fabulous.
>

Indeed, but completely untrue in the end. I had simply lost my
"python3" override in the update (and failed to notice). It works just
like before, unfortunately, so I'm back on track to try to reproduce
under controlled conditions, and propose fixes.

> I myself have a repository that has a variety of such issues. A common case is CP-1252 Smart Quote characters produced on Windows which are incompatible with UTF-8, without explicit conversion.
>

I was perplexed as to how you could still have these issues, if they
had magically cleared up for me. Now I know - they had not.

> However, a lot of these problems can be avoided by simply avoiding conversion to text in the first place. In many cases the incoming data doesn't need to be converted and can be passed around as binary.

That's certainly the behavior in git-p4 under python2, but I'm pretty
sure it's not "right". Git has a formal and coherent strategy for
encoding - commit (meta)data is interpreted as utf-8 unless an
"encoding" header is present in the commit. Git's output is utf-8
(with on-the-fly re-encoding if commits do have a different encoding
specified), unless you explicitly ask it to output in a different
encoding. These processes are "forgiving", in that a set of bytes that
cannot be interpreted as utf-8, in a commit that does not explicitly
specify an encoding, will be output as-is rather than being rejected
or warned about... But there is a "right way" to store data. When
CP-1252 (high-range) bytes are simply "passed through" into the git
data by python2, the resulting git commits cannot be displayed
correctly by most or any git clients - they expect utf-8.

On the other hand, in Perforce that does not appear to be the case at
all: as far as I can tell, p4v in windows is simply using windows's
active "legacy codepage", and reading or writing bytes accordingly
(CP1252 in my case); in windows cmd, same thing by default; in linux,
where utf8 is the norm nowadays, I get "unprintable character" output
when something was submitted from a CP-1252 environment (same as I do
for the python2-imported git commits, on any platform). If I switch a
windows cmd terminal to utf8 ("chcp 65001"), it behaves the same. If I
create a pending changelist with non-ANSI characters in Linux they get
submitted as utf8-encoded, and when I view the same pending changelist
in windows (from the command-line in its default legacy codepage, or
in p4v which also seems to use the legacy codepage), I see the classic
utf8-interpreted-as-CP1252 artifact "Æ" instead of "Æ".

I don't see any sort of "codepage selection" configuration either in
the perforce clients, or in the connection / server relationship.
Perforce tooling seems to simply assume that the Operating Systems of
all users will be configured with the same codepage - the bytes are
the bytes, clients encode and decode them however they will, and there
is simply no encoding metadata. US/western Windows clients, modern
linux clients, older mac clients, and cyrillic-configured clients, etc
will all have produced mutually-unintelligible content in the same
repo over the years.

In *my* environment, a majority of users have been using Windows over
the years to write commit messages, and most will have had CP-1252
configured, so I believe the most reasonable behavior would be "try
interpreting the bytes as utf-8, and if that fails then fall back to
interpreting as CP-1252, and either way, store the result in git as
utf-8" - but I don't know whether that is the right strategy for a
majority of environments. This strategy makes sense for me because
it's very rare for something that *can* be correctly read as utf-8 to,
in fact, not be intended to be utf-8; that was kind of the point in
the standard's development - so starting by *trying* to interpret as
utf-8 is a safe bet, and has value if *any* of your clients are likely
to have used it. In my environment there are linux users, and in fact
p4 user management happens in linux, so user names/descriptions are
also utf-8.

For other environments I suspect the "fallback codepage" should be
configurable (I don't know how codepage identifiers work
cross-platform in python, that sounds like it might be fun). You could
imagine having an "auto-detect" fallback option, but for the reasons
identified above this would need to operate on a per-string basis to
have high value, and I think CL descriptions (and user names) are
simply too short for autodetection to be a viable option.

Another reasonable strategy (less optimal for my environment) would be
to follow the "python2" approach of writing the bytes into git as they
were read from p4, but also additionally *telling* git what the
encoding is expected to be, by setting the "encoding" header
accordingly in the fast-import data. In my case that wouldn't work
well because, again, author names are utf-8, but commit messages are
more commonly CP-1252. Git expects all the strings in the commit to
use the same encoding (unless I've missed something).


> I'm slowly working toward this goal, and once this patch-set it merged I have a couple of other decoding patches in the pipeline. If you have any other failure cases, please do submit them as test cases, or bug reports at least.
>

I'd love to know what approach you're looking at!

I only use git-p4 for mirroring p4 activity into git, and not the
other way around, so my perspective on this whole topic might be a bit
skewed or short-sighted. For example, if we do end up storing the data
in git as utf-8, as git expects, then when someone comes to create and
submit p4 changelists from git activity, they might presumably want
that stuff to be output-encoded as CP-1252, if that is the more common
convention on their Perforce server...?

> I would prefer the script to discard Python 2 support, but even if the consensus is to retain it, Python 3 forces us to address these issues which is a great step in the right direction.
>

Right - discontinuing python 2 forces the elaboration of *some* sort
of strategy, besides the python2 default of "oh, just import
effectively-corrupt bytestreams into the git repo". That sounds like a
good thing :)

  reply	other threads:[~2022-04-05 22:31 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-10 16:46 [PATCH v4 00/22] git-p4: Various code tidy-ups Joel Holdsworth
2022-02-10 16:46 ` [PATCH v4 01/22] git-p4: add blank lines between functions and class definitions Joel Holdsworth
2022-02-10 16:46 ` [PATCH v4 02/22] git-p4: remove unneeded semicolons from statements Joel Holdsworth
2022-02-10 16:46 ` [PATCH v4 03/22] git-p4: indent with 4-spaces Joel Holdsworth
2022-02-10 16:46 ` [PATCH v4 04/22] git-p4: improve consistency of docstring formatting Joel Holdsworth
2022-02-10 16:46 ` [PATCH v4 05/22] git-p4: convert descriptive class and function comments into docstrings Joel Holdsworth
2022-02-10 16:46 ` [PATCH v4 06/22] git-p4: remove commented code Joel Holdsworth
2022-02-10 16:46 ` [PATCH v4 07/22] git-p4: sort and de-duplcate pylint disable list Joel Holdsworth
2022-02-10 16:46 ` [PATCH v4 08/22] git-p4: remove padding from lists, tuples and function arguments Joel Holdsworth
2022-02-10 16:46 ` [PATCH v4 09/22] git-p4: remove spaces around default arguments Joel Holdsworth
2022-02-10 16:46 ` [PATCH v4 10/22] git-p4: removed brackets when assigning multiple return values Joel Holdsworth
2022-02-10 16:46 ` [PATCH v4 11/22] git-p4: place a single space after every comma Joel Holdsworth
2022-02-10 16:46 ` [PATCH v4 12/22] git-p4: remove extraneous spaces before function arguments Joel Holdsworth
2022-02-10 16:46 ` [PATCH v4 13/22] git-p4: remove redundant backslash-continuations inside brackets Joel Holdsworth
2022-02-10 16:46 ` [PATCH v4 14/22] git-p4: remove spaces between dictionary keys and colons Joel Holdsworth
2022-02-10 16:46 ` [PATCH v4 15/22] git-p4: ensure every comment has a single # Joel Holdsworth
2022-02-10 16:46 ` [PATCH v4 16/22] git-p4: ensure there is a single space around all operators Joel Holdsworth
2022-02-10 16:46 ` [PATCH v4 17/22] git-p4: normalize indentation of lines in conditionals Joel Holdsworth
2022-02-10 16:46 ` [PATCH v4 18/22] git-p4: compare to singletons with "is" and "is not" Joel Holdsworth
2022-02-10 16:46 ` [PATCH v4 19/22] git-p4: only seperate code blocks by a single empty line Joel Holdsworth
2022-02-10 16:46 ` [PATCH v4 20/22] git-p4: move inline comments to line above Joel Holdsworth
2022-02-10 16:46 ` [PATCH v4 21/22] git-p4: seperate multiple statements onto seperate lines Joel Holdsworth
2022-02-10 16:46 ` [PATCH v4 22/22] git-p4: sort imports Joel Holdsworth
2022-04-02 12:14 ` [PATCH v4 00/22] git-p4: Various code tidy-ups Tao Klerks
2022-04-02 17:45   ` Tao Klerks
2022-04-04 14:46   ` Joel Holdsworth
2022-04-05  4:27     ` Tao Klerks
2022-04-05 11:29       ` Tao Klerks
2022-04-05 12:04         ` Joel Holdsworth
2022-04-05 17:16           ` Tao Klerks [this message]
2022-04-11  9:54             ` Tao Klerks

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPMMpojFrDL9v=fWfyBx-Ko7fdZkd0yroC058n0+KAvL8SPiYA@mail.gmail.com' \
    --to=tao@klerks.biz \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=andrew@adoakley.name \
    --cc=dendy.ua@gmail.com \
    --cc=dorgonman@hotmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jholdsworth@nvidia.com \
    --cc=joachim.kuebart@gmail.com \
    --cc=luke@diamand.org \
    --cc=seraphire@gmail.com \
    --cc=sunshine@sunshineco.com \
    --cc=tzadik.vanderhoof@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).