git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Tao Klerks <tao@klerks.biz>
Cc: Tao Klerks via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org
Subject: Re: [PATCH v2] [RFC] git-p4: improve encoding handling to support inconsistent encodings
Date: Wed, 13 Apr 2022 20:52:34 +0200	[thread overview]
Message-ID: <220413.86sfqgerf7.gmgdl@evledraar.gmail.com> (raw)
In-Reply-To: <CAPMMpoj3xZfKnH456AbiHatbBx98yXuj=yWBA8tdHhHdqn_H3Q@mail.gmail.com>


On Wed, Apr 13 2022, Tao Klerks wrote:

> On Wed, Apr 13, 2022 at 4:01 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>>
>> On Wed, Apr 13 2022, Tao Klerks via GitGitGadget wrote:
>>
>> > Under python2, git-p4 "naively" writes the Perforce bytestream into git
>> > metadata (and does not set an "encoding" header on the commits); this
>> > means that any non-utf-8 byte sequences end up creating invalidly-encoded
>> > commit metadata in git.
>>
>> If it doesn't have an "encoding" header isn't any sequence of bytes OK
>> with git, so how does it create invalid metadata in git?
>
> Just because git allows you to shove any sequence of bytes into a
> commit header, doesn't mean the resulting text is "valid" metadata
> text for all or most purposes. The correct way to encode text in git
> commit metadata is utf-8 (OR tell any readers of this data that it's
> something other than utf-8 via the encoding header) - it's just that
> git itself, the official client, is tolerant of bad byte sequences.
> Other clients are less tolerant. "Sublime Merge", for example, will
> fail to display the commit text at all in some contexts if the bytes
> are not valid utf-8 (or noted as being something else).
>
>>
>> Do you mean that something on the Python side gets confused and doesn't
>> correctly encode it in that case, or that it's e.g. valid UTF-8 but
>> we're lacking the metadata?
>
> In git-p4 under python2, the bytes are simply copied from the perforce
> commit metadata into the git commit metadata verbatim; if those bytes
> happen to be valid utf-8, then they will be interpreted as such in git
> and everything is great. If that is *not* the case, eg the bytes are
> actually windows cp1252 (with bytes/characters in the x8a+ range),
> then "git log" for example will output the raw bytes, and anything
> looking at those bytes (a terminal, or a process that called git) will
> get those unexpected bytes, and need to deal accordingly. A terminal
> will probably display "unprintable character" glyphs, python3 will
> blow up by default, python 2 will be perfectly happy by default, etc.
>
> I summarize this "non-utf-8 bytes in a git commit message without a
> qualifying 'encoding' header" situation as "invalidly-encoded commit
> metadata in git", due to the impact on downstream consumers of git
> metadata. Is there a better characterization?

I must admit that all this time I'd been missing the "Lack of this
header implies that the commit log message is encoded in UTF-8." part of
the docs added in 5dc7bcc2453 (Documentation: i18n commit log message
notes., 2006-12-30). I.e. I thought that "encoding"-header-less just
meant/implied "whatever".

But then again there wouldn't be much point in the encoding header if
its value is "utf8", so surely we'd want to use the lack of a header to
disambiguate utf8 v.s. non-utf8.

AFAICT we only allow selecting between encodings, not "no idea what this
is, but here's some raw sequence of bytes", except by omitting the
header.

It seems to me that between legacy/strict/fallback there's a 4th setting
missing here. I.e. a "try-encoding". One where if your data is valid
utf8 (or cp1252 if we want to get fancy and combine it with "fallback")
you get an encoding header, but having tried that we'll just write
whatever raw data we found, but *not* munge it.

I haven't worked with p4, but having done some legacy SCM imports it was
nice to be able to map data 1=1 to git in those "odd encoding" cases, or
even cases where there was raw binary in a commit message or whatever...

Anyway, all of this is fine with me as-is, I just had a drive-by
question after some admittedly not very careful reading, sorry.

  reply	other threads:[~2022-04-13 19:05 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-11  9:42 [PATCH] [RFC] git-p4: improve encoding handling to support inconsistent encodings Tao Klerks via GitGitGadget
2022-04-13  6:24 ` [PATCH v2] " Tao Klerks via GitGitGadget
2022-04-13 13:59   ` Ævar Arnfjörð Bjarmason
2022-04-13 15:18     ` Tao Klerks
2022-04-13 18:52       ` Ævar Arnfjörð Bjarmason [this message]
2022-04-14  9:38         ` Tao Klerks
2022-04-13 20:41   ` Andrew Oakley
2022-04-14  9:57     ` Tao Klerks
2022-04-17 18:11       ` Andrew Oakley
2022-04-19 20:30         ` Tao Klerks
2022-04-19 20:19   ` [PATCH v3] " Tao Klerks via GitGitGadget
2022-04-19 20:33     ` Tao Klerks
2022-04-30 19:26     ` [PATCH v4] " Tao Klerks via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=220413.86sfqgerf7.gmgdl@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=tao@klerks.biz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).