git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] git-p4: preserve utf8 BOM when importing from p4 to git
@ 2022-04-04  5:50 Tao Klerks via GitGitGadget
  0 siblings, 0 replies; 7+ messages in thread
From: Tao Klerks via GitGitGadget @ 2022-04-04  5:50 UTC (permalink / raw)
  To: git; +Cc: Tao Klerks, Tao Klerks

From: Tao Klerks <tao@klerks.biz>

Perforce has a file type "utf8" which represents a text file with
explicit BOM. utf8-encoded files *without* BOM are stored as
regular file type "text". The "utf8" file type behaves like text
in all but one important way: it is stored, internally, without
the leading 3 BOM bytes.

git-p4 has historically imported utf8-with-BOM files (files stored,
in Perforce, as type "utf8") the same way as regular text files -
losing the BOM in the process.

Under most circumstances this issue has little functional impact,
as most systems consider the BOM to be optional and redundant, but
this *is* a correctness failure, and can have lead to practical
issues for example when BOMs are explicitly included in test files,
for example in a file encoding test suite.

Fix the handling of utf8-with-BOM files when importing changes from
p4 to git, and introduce a test that checks it is working correctly.

Signed-off-by: Tao Klerks <tao@klerks.biz>
---
    git-p4: preserve utf8 BOM when importing from p4 to git
    
    I manually tested these changes with python2 and python3 - I don't know
    whether there is a more rigorous approach possible than changing the
    system default python and rerunning the "t98xx" tests, but that did work
    (and initially highlighted an issue, now fixed).

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1203%2FTaoK%2Fgit-p4-utf8-bom-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1203/TaoK/git-p4-utf8-bom-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1203

 git-p4.py                  | 10 ++++++++++
 t/t9802-git-p4-filetype.sh | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

diff --git a/git-p4.py b/git-p4.py
index a9b1f904410..6d932e7ed76 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -2885,6 +2885,16 @@ class P4Sync(Command, P4UserMap):
             print("\nIgnoring apple filetype file %s" % file['depotFile'])
             return
 
+        if type_base == "utf8":
+            # The type utf8 explicitly means utf8 *with BOM*. These are
+            # streamed just like regular text files, however, without
+            # the BOM in the stream.
+            # Therefore, to accurately import these files into git, we
+            # need to explicitly re-add the BOM before writing.
+            # 'contents' is a set of bytes in this case, so create the
+            # BOM prefix as a b'' literal.
+            contents = [b'\xef\xbb\xbf' + contents[0]] + contents[1:]
+
         # Note that we do not try to de-mangle keywords on utf16 files,
         # even though in theory somebody may want that.
         regexp = p4_keywords_regexp_for_type(type_base, type_mods)
diff --git a/t/t9802-git-p4-filetype.sh b/t/t9802-git-p4-filetype.sh
index 19073c6e9f8..2a6ee2a4678 100755
--- a/t/t9802-git-p4-filetype.sh
+++ b/t/t9802-git-p4-filetype.sh
@@ -333,4 +333,38 @@ test_expect_success SYMLINKS 'empty symlink target' '
 	)
 '
 
+test_expect_success SYMLINKS 'utf-8 with and without BOM in text file' '
+	(
+		cd "$cli" &&
+
+		# some utf8 content
+		echo some tǣxt >utf8-nobom-test &&
+
+		# same utf8 content as before but with bom
+		echo some tǣxt | sed '\''s/^/\xef\xbb\xbf/'\'' >utf8-bom-test &&
+
+		# bom only
+		dd bs=1 count=3 if=utf8-bom-test of=utf8-bom-empty-test &&
+
+		p4 add utf8-nobom-test utf8-bom-test utf8-bom-empty-test &&
+		p4 submit -d "add utf8 test files"
+	) &&
+	test_when_finished cleanup_git &&
+
+	git p4 clone --dest="$git" //depot@all &&
+	(
+		cd "$git" &&
+		git checkout refs/remotes/p4/master &&
+
+		echo some tǣxt >utf8-nobom-check &&
+		test_cmp utf8-nobom-check utf8-nobom-test &&
+
+		echo some tǣxt | sed '\''s/^/\xef\xbb\xbf/'\'' >utf8-bom-check &&
+		test_cmp utf8-bom-check utf8-bom-test &&
+
+		dd bs=1 count=3 if=utf8-bom-check of=utf8-bom-empty-check &&
+		test_cmp utf8-bom-empty-check utf8-bom-empty-test
+	)
+'
+
 test_done

base-commit: 4b6846d9dcd391164b72bd70e8a0c0e09776afe3
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] git-p4: preserve utf8 BOM when importing from p4 to git
@ 2022-12-14  6:10 Tzadik Vanderhoof
  2022-12-14  8:29 ` Junio C Hamano
  2022-12-14 18:24 ` Tao Klerks
  0 siblings, 2 replies; 7+ messages in thread
From: Tzadik Vanderhoof @ 2022-12-14  6:10 UTC (permalink / raw)
  To: Git List
  Cc: Tao Klerks, Luke Diamand, Junio C Hamano, Eric Sunshine,
	Ævar Arnfjörð Bjarmason, Dorgon Chang,
	Joachim Kuebart, Daniel Levin, Johannes Schindelin, Ben Keene,
	Andrew Oakley

From: Tao Klerks <tao@klerks.biz>
> Perforce has a file type "utf8" which represents a text file with
> explicit BOM. utf8-encoded files *without* BOM are stored as
> regular file type "text". The "utf8" file type behaves like text
> in all but one important way: it is stored, internally, without
> the leading 3 BOM bytes.
>
> git-p4 has historically imported utf8-with-BOM files (files stored,
> in Perforce, as type "utf8") the same way as regular text files -
> losing the BOM in the process.
>
> Under most circumstances this issue has little functional impact,
> as most systems consider the BOM to be optional and redundant, but
> this *is* a correctness failure, and can have lead to practical
> issues for example when BOMs are explicitly included in test files,
> for example in a file encoding test suite.
>
> Fix the handling of utf8-with-BOM files when importing changes from
> p4 to git, and introduce a test that checks it is working correctly

I don't really understand the problem that this patch addresses, but I
feel it was incorrect to modify the behavior of git-p4 in this way.
This change has made git-p4 behave differently than the native
Perforce tools.

Perforce already has a "variable" (setting) to control this behavior.
If P4CHARSET is set to "utf8-bom", the BOM will be added to files in
the local workspace of type "utf8".  If P4CHARSET is set to "utf8",
the BOM will NOT be stored in the local workspace file, even if its
type is "utf8"!

Whatever the problem was that motivated this change, it should have
been solved using the Perforce variable P4CHARSET, not by modifying
the behavior of git-p4.  This new behavior has made it impossible for
me to submit changes to files of type "utf8"!  Any attempt fails with
"patch does not apply" and the erroneously added BOM is the cause.

I propose rolling back the patch that introduced this behavior, and if
that is not possible, I would like to submit a patch that adds a new
setting in the "git-p4" section that will enable users to opt out of
this new behavior (for example a boolean setting "git-p4.noUtf8Bom").

-- 
Tzadik

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] git-p4: preserve utf8 BOM when importing from p4 to git
  2022-12-14  6:10 Tzadik Vanderhoof
@ 2022-12-14  8:29 ` Junio C Hamano
  2022-12-14 18:24 ` Tao Klerks
  1 sibling, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2022-12-14  8:29 UTC (permalink / raw)
  To: Tzadik Vanderhoof
  Cc: Git List, Tao Klerks, Luke Diamand, Eric Sunshine,
	Ævar Arnfjörð Bjarmason, Dorgon Chang,
	Joachim Kuebart, Daniel Levin, Johannes Schindelin, Ben Keene,
	Andrew Oakley

Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> writes:

> I propose rolling back the patch that introduced this behavior, and if
> that is not possible, I would like to submit a patch that adds a new
> setting in the "git-p4" section that will enable users to opt out of
> this new behavior (for example a boolean setting "git-p4.noUtf8Bom").

If a change in the most recent broke P4 users who have happily been
using "git p4", then we definitely should do both.  Immediately
revert in preparation for Git 2.39.1 or Git 2.39.2, while working to
reintroduce it as a separate option to allow the feature to add BOM
automatically to please users like Tao.

But for such a resolution, this report should have come while
fbe5f6b8 (git-p4: preserve utf8 BOM when importing from p4 to git,
2022-04-04) was still in flight, before it was merged for e4a4b315
(Git 2.37, 2022-06-27).  We had three months to do so, and now it is
way too late.




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] git-p4: preserve utf8 BOM when importing from p4 to git
  2022-12-14  6:10 Tzadik Vanderhoof
  2022-12-14  8:29 ` Junio C Hamano
@ 2022-12-14 18:24 ` Tao Klerks
  2022-12-14 23:11   ` Junio C Hamano
  1 sibling, 1 reply; 7+ messages in thread
From: Tao Klerks @ 2022-12-14 18:24 UTC (permalink / raw)
  To: Tzadik Vanderhoof
  Cc: Git List, Luke Diamand, Junio C Hamano, Eric Sunshine,
	Ævar Arnfjörð Bjarmason, Dorgon Chang,
	Joachim Kuebart, Daniel Levin, Johannes Schindelin, Ben Keene,
	Andrew Oakley

On Wed, Dec 14, 2022 at 7:11 AM Tzadik Vanderhoof
<tzadik.vanderhoof@gmail.com> wrote:
>
> From: Tao Klerks <tao@klerks.biz>
> > Fix the handling of utf8-with-BOM files when importing changes from
> > p4 to git, and introduce a test that checks it is working correctly
>
> I don't really understand the problem that this patch addresses, but I
> feel it was incorrect to modify the behavior of git-p4 in this way.
> This change has made git-p4 behave differently than the native
> Perforce tools.

Hi Tzadik, first of all my apologies for the fact that these changes
broke your flow - it was clearly not the intent to cause any users to
have to change the way they do things (or, worse, make it impossible
for them to work, or cause BOM-presence inconsistencies in their
repos).

When you say "This change has made git-p4 behave differently than the
native Perforce tools", I suspect there's a misunderstanding, but
you've clearly uncovered a git-p4 "workflow bug" that this change
introduced.

The git-p4 tooling can be and is used for a number of different things.

If I understand your usecase correctly, you keep a local git repo that
you work in, and others around you continue to use p4; you
(presumably) don't share your git repo with any other users.

Under those single-user circumstances, it makes sense that you expect
your git repo to contain "your workspace's understanding" of the
perforce repo/contents, rather than "the actual contents of the repo"
- and in fact, if I understand correctly, the fact that your workspace
has a specific view on what's in the repo (does UTF-8 BOM stripping)
means that the git-p4 "submit"/"commit" command finds unexpected
differences between your git files and your p4 workspace files, when
trying to submit git changes into your p4 workspace.

In other use-cases where the git repo is shared between multiple
users, or where a perforce-to-git migration is facilitated by this
tooling, it is critical that the result of importing a perforce
repo/branch into git be "faithful", retaining the actual contents of
UTF-8-BOM files - including those first 3 bytes that are ignored by
many text-file-processing systems, but significant to others.

>
> Perforce already has a "variable" (setting) to control this behavior.
> If P4CHARSET is set to "utf8-bom", the BOM will be added to files in
> the local workspace of type "utf8".  If P4CHARSET is set to "utf8",
> the BOM will NOT be stored in the local workspace file, even if its
> type is "utf8"!

This behavior makes sense for workstation tooling like the "p4" tool,
but to the best of my knowledge it does not make sense for "repo
cloning" tooling.

Unless I'm misunderstanding something, git-p4 was working as you
wanted before these changes, precisely *because* you had specified
P4CHARSET=utf8 in your workspace (and you're the only one using your
git repo, or all others set P4CHARSET the same way and have the same
BOMless perspective in their worktrees), so git-p4's buggy clone/sync
behavior (swallowing UTF-8 BOMs when importing repo history from
perforce) matched your p4 workspace's intentionally configured
behavior.

There is a complicating factor here that may be clouding my
understanding: Perforce has two server modes of operation, "unicode
enabled server" and... non-unicode-enabled? Normal?

The idea that there are utf-8 files stored without BOM (and stored, in
fact, as type "text"), and utf-8 files stored with BOM (as type
"utf8") is a notion of the not-unicode-enabled Perforce servers. It
may be that no such distinction exists / can be stored in "unicode
enabled servers" - I need to do some testing with such servers to
better understand how this works. As far as I understand, you must be
using a "unicode enabled server" as the p4 docs say P4CHARSET must
always be unset / set to "none" on "normal" servers (and must always
be set, with "unicode enabled servers", even if implicitly to "auto").

>
> Whatever the problem was that motivated this change, it should have
> been solved using the Perforce variable P4CHARSET, not by modifying
> the behavior of git-p4.

Obviously if the conflict with workflows like yours had been
understood, the change would not have been made as it was.

However, I'm not sure that I understand your proposal here to solve a
problem "using the Perforce variable P4CHARSET". The problem was that,
on standard Perforce servers, when you add a UTF-8-BOM file, it gets
stored with type "utf8". When someone else syncs that file, they get
back what was put in - a UTF-8-BOM file. This is the way you generally
expect/hope a VCS will typically work. Conversely, if you add a UTF-8
file that does *not* have the BOM, it will be stored as type "text" -
and someone syncing it will, again, get out what was put in. This is
perforce client tooling working correctly/normally.

git-p4, on the other hand, does not use "p4 sync", it uses "p4 print",
and with this command, the BOM is *not* included on "utf8" type files.
This is not dependent on P4CHARSET, it's just an aspect of the
contract of "p4 print" - it expects the recipient of text-derived
files to parse the p4 "type" and  handle any encoding transformations
(like adding a BOM) accordingly.

Again, I'm not attempting to defend the breakage - just outlining why
I don't see how "using the Perforce variable P4CHARSET" would solve
anything.

> This new behavior has made it impossible for
> me to submit changes to files of type "utf8"!  Any attempt fails with
> "patch does not apply" and the erroneously added BOM is the cause.

I will try to understand the "unicode enabled server" behavior today
or tomorrow and see what options might make sense.

> I propose rolling back the patch that introduced this behavior,

Junio is the expert here and has noted it's a little late for that. I
obviously defer to his expertise as to git's release and backout
strategy.

I would like to have a go at understanding what the options are (how
we can get correct and functional behavior for all users), before
proposing a specific course of action.

> and if
> that is not possible, I would like to submit a patch that adds a new
> setting in the "git-p4" section that will enable users to opt out of
> this new behavior (for example a boolean setting "git-p4.noUtf8Bom").

I suspect that is the easiest tactical solution, but I'd be concerned
that this would require users like you (on such unicode-enabled
servers, with the P4CHARSET=utf8 setting locally), to "discover" the
cause of submission errors they get, and discover the setting that
would enable them to work around it. At least in the medium term, I
would prefer to find a way to have git-p4 intentionally respect the
"P4CHARSET" setting (with a warning about repo faithfulness if you're
losing any data or causing potential repo inconsistency along the
way).

As a workaround for you in the immediate term (and sorry, I feel
super-dirty even writing this) the only option I see for "git-p4
submit" to work would be for you to, when this happens, temporarily
change your P4CHARSET value to "utf8-bom" and re-sync the affected
files in your workspace, before you "git-p4 submit". I have not tested
this, but given my limited understanding of your situation it seems
like this would work.

If you already have a different workaround please let us know what it is!

Best regards,
Tao

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] git-p4: preserve utf8 BOM when importing from p4 to git
  2022-12-14 18:24 ` Tao Klerks
@ 2022-12-14 23:11   ` Junio C Hamano
  2022-12-19  9:09     ` Tao Klerks
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2022-12-14 23:11 UTC (permalink / raw)
  To: Tao Klerks
  Cc: Tzadik Vanderhoof, Git List, Luke Diamand, Eric Sunshine,
	Ævar Arnfjörð Bjarmason, Dorgon Chang,
	Joachim Kuebart, Daniel Levin, Johannes Schindelin, Ben Keene,
	Andrew Oakley

Tao Klerks <tao@klerks.biz> writes:

> Again, I'm not attempting to defend the breakage - just outlining why
> I don't see how "using the Perforce variable P4CHARSET" would solve
> anything.
>
>> This new behavior has made it impossible for
>> me to submit changes to files of type "utf8"!  Any attempt fails with
>> "patch does not apply" and the erroneously added BOM is the cause.
>
> I will try to understand the "unicode enabled server" behavior today
> or tomorrow and see what options might make sense.
>
>> I propose rolling back the patch that introduced this behavior,
>
> Junio is the expert here and has noted it's a little late for that. I
> obviously defer to his expertise as to git's release and backout
> strategy.
>
> I would like to have a go at understanding what the options are (how
> we can get correct and functional behavior for all users), before
> proposing a specific course of action.

It sounds like, if your conjecture turns out to be correct in that
those P4 users who interact unicode enabled servers would have
P4CHARSET and others don't, we may not need an extra configuration
but pay attention to the P4CHARSET variable (or lack of it) and
switch the behaviour.

Thanks for a well reasoned response.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] git-p4: preserve utf8 BOM when importing from p4 to git
  2022-12-14 23:11   ` Junio C Hamano
@ 2022-12-19  9:09     ` Tao Klerks
  2022-12-22  8:26       ` Tao Klerks
  0 siblings, 1 reply; 7+ messages in thread
From: Tao Klerks @ 2022-12-19  9:09 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Tzadik Vanderhoof, Git List, Luke Diamand, Eric Sunshine,
	Ævar Arnfjörð Bjarmason, Dorgon Chang,
	Joachim Kuebart, Daniel Levin, Johannes Schindelin, Ben Keene,
	Andrew Oakley

On Thu, Dec 15, 2022 at 12:11 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Tao Klerks <tao@klerks.biz> writes:
>
> > Again, I'm not attempting to defend the breakage - just outlining why
> > I don't see how "using the Perforce variable P4CHARSET" would solve
> > anything.
> >
> >> This new behavior has made it impossible for
> >> me to submit changes to files of type "utf8"!  Any attempt fails with
> >> "patch does not apply" and the erroneously added BOM is the cause.
> >
> > I will try to understand the "unicode enabled server" behavior today
> > or tomorrow and see what options might make sense.
> >
> >> I propose rolling back the patch that introduced this behavior,
> >
> > Junio is the expert here and has noted it's a little late for that. I
> > obviously defer to his expertise as to git's release and backout
> > strategy.
> >
>
> It sounds like, if your conjecture turns out to be correct in that
> those P4 users who interact unicode enabled servers would have
> P4CHARSET and others don't, we may not need an extra configuration
> but pay attention to the P4CHARSET variable (or lack of it) and
> switch the behaviour.

Yes, I suspect some sort of detection will be required. There appears
to be no way to query the server for this "unicode mode" directly, but
you can force the client to try connecting in the "wrong" mode for the
server, and catch the corresponding error. Ugly, but effective.

(the reason it's hard to just test for "P4CHARSET" is that there are
several ways to set it, not just the environment, and there are
multiple versions of the setting, per-connection or global; setting
the global override and testing for failure is likely to be safer than
attempting to understand/evaluate the hierarchy of settings)

> > I would like to have a go at understanding what the options are (how
> > we can get correct and functional behavior for all users), before
> > proposing a specific course of action.

I have finally managed to start testing with the "unicode enabled
server" behavior.

So far I've learned that:
 * Some of our tests around file content encoding handling do fail
with the server in this mode (not necessarily because we're doing the
wrong thing, but because the server's behavior doesn't match our
expectations) these failures may correspond to bugs to be fixed, or
tests to be adjusted to match appropriate expectations in this
"unicode enabled mode"
 * Our tests around "git p4 submit" *don't* seem to fail, even on
utf-8-bom files - so I have not yet reproduced Tzadik's issue

(I keep placing "unicode enabled server" in quotes because I don't
want to give the impression that perforce in "normal" mode doesn't
handle unicode content - it absolutely does, but... differently.)

I definitely need to keep testing around this to understand what the
right thing to do for Tzadik (and others like him of course) might be.

Tzadik, could you provide any more detail about the failing situation?
One piece of info that might be particularly helpful is *what is the
exact/full p4 FileType of the problem file?*

Thanks,
Tao

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] git-p4: preserve utf8 BOM when importing from p4 to git
  2022-12-19  9:09     ` Tao Klerks
@ 2022-12-22  8:26       ` Tao Klerks
  0 siblings, 0 replies; 7+ messages in thread
From: Tao Klerks @ 2022-12-22  8:26 UTC (permalink / raw)
  To: Tzadik Vanderhoof
  Cc: Junio C Hamano, Git List, Luke Diamand, Eric Sunshine,
	Ævar Arnfjörð Bjarmason, Dorgon Chang,
	Joachim Kuebart, Daniel Levin, Johannes Schindelin, Ben Keene,
	Andrew Oakley

On Mon, Dec 19, 2022 at 10:09 AM Tao Klerks <tao@klerks.biz> wrote:
>
> I definitely need to keep testing around this to understand what the
> right thing to do for Tzadik (and others like him of course) might be.
>
> Tzadik, could you provide any more detail about the failing situation?
> One piece of info that might be particularly helpful is *what is the
> exact/full p4 FileType of the problem file?*

Hi Tzadik, over the past few days I've been making progress towards
understanding the behavior of unicode-enabled perforce servers (and p4
client tooling against such servers):
* Perforce always stores/adds new files containing a utf8 bom as type
"utf8", and always syncs them to the workspace with a BOM, regardless
of the P4CHARSET env variable. Same goes for utf16 with bom being
stored as type "utf16".
* The only filetype that gets synced to the workspace differently
depending on P4CHARSET is the "unicode" type - a type exclusive to
*specific files* on unicode-enabled servers
* The P4CHARSET env var is used in three situations that I've
detected/understood so far:
  1. It is used when "p4 add"ing new files - if the content of the
file is valid/parseable according to the P4CHARSET, and is not a
built-in more-specific type like utf8-with-bom (perforce type "utf8")
or utf16-with-bom (perforce type "utf16"), then it gets parsed
according to that charset and stored as type "unicode". If the content
is not parseable according to that charset, nor the more specific
types, then it gets stored as "text" or "binary".
  2. It is used when "p4 sync"ing files of type "unicode" to the
workspace. All other types are synced in a "standard" way, only type
"unicode" is sensitive to the charset config during "sync".
  3. It is used when "p4 print"ing files of type "unicode", "utf8",
and "utf16" (unless P4COMMANDCHARSET overrides, or a "-o" output file
is specified). This last behavior spells trouble for git-p4...
* When a file is stored as type "unicode", the way it gets
synchronized to the workspace always depends on the P4CHARSET env
variable. When a "with-bom" charset value is configured, like
"utf8-bom" or "utf16", the resulting files on disk / in the workspace
end up indistinguishable from the corresponding perforce native
with-bom type "utf8" or "utf16".
* If an output file (-o) is specified, "p4 print" behaves like "p4
sync" - using P4CHARSET only for file type "unicode".
* If no output file (-o) is specified, the output of "p4 print"
depends on the P4CHARSET env variable or the more-specific
P4COMMANDCHARSET if specified, but with a number of provisos:
** "p4 print" will refuse to print using a "wide" charset like
"utf16"; it will fall back to utf8.
** If you configure P4CHARSET (and/or P4COMMANDCHARSET) to a charset
value with bom ("utf8-bom" specifically, I guess), you don't in fact
get the BOM in the "p4 print" output.
** If you configure P4CHARSET (and/or P4COMMANDCHARSET) to a charset
value that cannot express the specific contents of a utf8, utf16 or
unicode file, "p4 print" will fail.
* The behavior of "p4 print" matters in git-p4 because this is the p4
command used to import files; for "utf16" files a "file-targeting"
form is used (-o), and for all other file types, the regular form.
* One other "weirdness" in all this is that if you remove the BOM from
a "utf8" file and submit, this change is silently swallowed; your
workspace copy of the file remains without BOM, but if you force-sync
the file, the BOM magically reappears.

This is all further complicated by the fact that older Perforce server
(and client?) versions behaved differently! (eg see
https://stackoverflow.com/questions/38320814/bom-issue-in-unicode-perforce-server).

So far in my testing of git-p4 against a unicode server, I've
reproduced two concrete "problem workflows" specific to this server
configuration:

1: If you have P4CHARSET configured to "utf8-bom"
* "git p4 clone" imports your "unicode" files without BOM (because "p4
print" always omits the BOM)
* "p4 sync" syncs your "unicode" files *with* BOM - because that's
what P4CHARSET is telling it to do
* "git p4 submit" fails to apply (some) git changes (to "unicode"
files) because the git-workspace files do not have a BOM, and the p4
workspace files do.

2: If you have P4CHARSET configured to a non-utf8 encoding like "winansi":
* "git p4 clone" imports your "utf8" files as winansi-encoded (because
"p4 print" respects the configured P4CHARSET), and adds a utf8 bom
(because of the fix earlier this year)
* "p4 sync" syncs your "utf8" files encoded as utf8-with-BOM - because
that's the only and correct behavior
* Some of these files in the git repo may not open correctly in any
given editor, if they have a utf8 bom but are not actually
utf8-encoded (if they contain characters in the 128+ codepoint range)
* "git p4 submit" fails to apply (some) git changes (to these "utf8"
files) because these git-workspace files can have different bytes to
the utf8-encoded files in the workspace.

There seem to be two corresponding "fundamental bugs" in "git-p4 clone":
* Files of type "unicode" are imported according to the
client-configured charset (with some inconsistency around wide
charsets like utf16) - correct behavior is hard to define, as perforce
treats these files as having no "fundamental" encoding, but current
behavior is clearly buggy. It's unclear whether properly/consistently
respecting the p4-client-configured encoding makes more sense (to
support local-git-repo workflows transparently) by default, or whether
it makes more to use "utf8" by default, to have a consistent and
correct representation of the Perforce repository contents in git by
default (and then the equivalent of P4CHARSET behavior in
individual/personal git repos would need to be implemented using git
smudge and clean filters, I assume, for "git p4 submit" to work
against non-utf8-configured p4 workspaces).
* If unicode mode is enabled and P4CHARSET is not set to "utf8" or
"utf8-bom", files of perforce type "utf8" will be imported according
to the client-configured charset instead of the actual/internal (and
p4-workspace-respected) encoding - and additionally, with "my fix",
they will still get a utf8 BOM prepended, which can result in corrupt
files in the git repo.

Perplexingly, most of the issues I've identified with unicode-mode
perforce servers are not caused by my changes: the only *new* problem
this year is that, if P4CHARSET is set to something other than "utf8"
or "utf8-bom", files of p4 type "utf8" will now not only be imported
into the git repo with the *wrong* encoding (prior behavior), but will
additionally have a utf8 BOM added, resulting in *corrupted* files
(if/when the charset encoding diverges, eg if the content includes
characters in the 128+ codepoint range for P4CHARSET=winansi).

It's worth noting there's also an existing workflow failure that
applies equally to unicode-mode and regular servers, when users delete
the BOM from existing "utf8" files in their git repo and submit that
change into perforce:
* "git p4 clone" imports your "utf8" files correctly, as utf8-with-BOM
(in the case of a unicode-mode perforce server, assume "P4CHARSET" is
set to "utf8" or "utf8-bom")
* "p4 sync" syncs your "utf8" files encoded as utf8-with-BOM - because
that's the only and correct behavior
* The user accidentally or intentionally removes the BOM as part of
some edits to the file in the git workspace
* "git p4 submit" applies the change to the p4 workspace successfully,
it is committed to perforce (but perforce ignores the BOM removal bit)
* "git p4 submit" does its rebase of the git branch on top of the
now-submitted-and-reimported p4 changes, but the rebase fails because
the changes are different: in the p4 history there was no BOM-removal.

Tzadik, could you please confirm:
* Your p4 client and server versions ("p4 info")
* The perforce "types" of files you are concerned with (check with "p4
files <PATH>" or in the p4v GUI)
* Your OS
* Your P4CHARSET value, if set explicitly to something other than "auto"

I will keep exploring how this all works, and proposing fixes to
git-p4's vulnerable use of "p4 print". The correct outcome/behavior is
clear (and achievable) for files of perforce type "utf8", but I'm not
confident as to what the best default behavior will be for type
"unicode" - I imagine it will be to keep current client-specific
local-workflow-supporting behavior, fixing it to support wide
encodings and BOM-specifying encodings, and recommending that
P4CHARSET be set to "utf8" explicitly for canonical git-p4 clone
operations / for shared repos.

I have not yet looked into filepath-handling, commit-message-handling,
username-handling etc on unicode-enabled servers, but I suspect
similar issues will apply there: we'll have to figure out what the
correct behavior would be for "canonical" git repos, whether it should
be any different for "local-workflow-only" git repos, and how either
desirable behavior compares to current behavior for various values of
P4CHARSET. Everything probably already works correctly when
P4CHARSET=utf8.

For the "reappearing BOM" problem/behavior of perforce, I'm also not
sure what the right approach is, even manually. I've tried asking p4
to reevaluate the "type" of the file upon edit, but can't find any way
to get it to *auto*-detect the new file type. Given that I've never
seen this issue described before, it doesn't feel hugely important
though. I will at least add a failing test case to cover this, when I
get to it.

For now, based on the original problem statement/bug report, I cannot
understand what specific problem is affecting Tzadik, and whether or
how it was caused by the 2022 utf8 bom-preserving fix to "git-p4
clone".

Thanks,
Tao

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-12-22  8:26 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-04  5:50 [PATCH] git-p4: preserve utf8 BOM when importing from p4 to git Tao Klerks via GitGitGadget
  -- strict thread matches above, loose matches on Subject: below --
2022-12-14  6:10 Tzadik Vanderhoof
2022-12-14  8:29 ` Junio C Hamano
2022-12-14 18:24 ` Tao Klerks
2022-12-14 23:11   ` Junio C Hamano
2022-12-19  9:09     ` Tao Klerks
2022-12-22  8:26       ` Tao Klerks

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).