* [PATCH] git-p4: preserve utf8 BOM when importing from p4 to git @ 2022-04-04 5:50 Tao Klerks via GitGitGadget 0 siblings, 0 replies; 7+ messages in thread From: Tao Klerks via GitGitGadget @ 2022-04-04 5:50 UTC (permalink / raw) To: git; +Cc: Tao Klerks, Tao Klerks From: Tao Klerks <tao@klerks.biz> Perforce has a file type "utf8" which represents a text file with explicit BOM. utf8-encoded files *without* BOM are stored as regular file type "text". The "utf8" file type behaves like text in all but one important way: it is stored, internally, without the leading 3 BOM bytes. git-p4 has historically imported utf8-with-BOM files (files stored, in Perforce, as type "utf8") the same way as regular text files - losing the BOM in the process. Under most circumstances this issue has little functional impact, as most systems consider the BOM to be optional and redundant, but this *is* a correctness failure, and can have lead to practical issues for example when BOMs are explicitly included in test files, for example in a file encoding test suite. Fix the handling of utf8-with-BOM files when importing changes from p4 to git, and introduce a test that checks it is working correctly. Signed-off-by: Tao Klerks <tao@klerks.biz> --- git-p4: preserve utf8 BOM when importing from p4 to git I manually tested these changes with python2 and python3 - I don't know whether there is a more rigorous approach possible than changing the system default python and rerunning the "t98xx" tests, but that did work (and initially highlighted an issue, now fixed). Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1203%2FTaoK%2Fgit-p4-utf8-bom-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1203/TaoK/git-p4-utf8-bom-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/1203 git-p4.py | 10 ++++++++++ t/t9802-git-p4-filetype.sh | 34 ++++++++++++++++++++++++++++++++++ 2 files changed, 44 insertions(+) diff --git a/git-p4.py b/git-p4.py index a9b1f904410..6d932e7ed76 100755 --- a/git-p4.py +++ b/git-p4.py @@ -2885,6 +2885,16 @@ class P4Sync(Command, P4UserMap): print("\nIgnoring apple filetype file %s" % file['depotFile']) return + if type_base == "utf8": + # The type utf8 explicitly means utf8 *with BOM*. These are + # streamed just like regular text files, however, without + # the BOM in the stream. + # Therefore, to accurately import these files into git, we + # need to explicitly re-add the BOM before writing. + # 'contents' is a set of bytes in this case, so create the + # BOM prefix as a b'' literal. + contents = [b'\xef\xbb\xbf' + contents[0]] + contents[1:] + # Note that we do not try to de-mangle keywords on utf16 files, # even though in theory somebody may want that. regexp = p4_keywords_regexp_for_type(type_base, type_mods) diff --git a/t/t9802-git-p4-filetype.sh b/t/t9802-git-p4-filetype.sh index 19073c6e9f8..2a6ee2a4678 100755 --- a/t/t9802-git-p4-filetype.sh +++ b/t/t9802-git-p4-filetype.sh @@ -333,4 +333,38 @@ test_expect_success SYMLINKS 'empty symlink target' ' ) ' +test_expect_success SYMLINKS 'utf-8 with and without BOM in text file' ' + ( + cd "$cli" && + + # some utf8 content + echo some tǣxt >utf8-nobom-test && + + # same utf8 content as before but with bom + echo some tǣxt | sed '\''s/^/\xef\xbb\xbf/'\'' >utf8-bom-test && + + # bom only + dd bs=1 count=3 if=utf8-bom-test of=utf8-bom-empty-test && + + p4 add utf8-nobom-test utf8-bom-test utf8-bom-empty-test && + p4 submit -d "add utf8 test files" + ) && + test_when_finished cleanup_git && + + git p4 clone --dest="$git" //depot@all && + ( + cd "$git" && + git checkout refs/remotes/p4/master && + + echo some tǣxt >utf8-nobom-check && + test_cmp utf8-nobom-check utf8-nobom-test && + + echo some tǣxt | sed '\''s/^/\xef\xbb\xbf/'\'' >utf8-bom-check && + test_cmp utf8-bom-check utf8-bom-test && + + dd bs=1 count=3 if=utf8-bom-check of=utf8-bom-empty-check && + test_cmp utf8-bom-empty-check utf8-bom-empty-test + ) +' + test_done base-commit: 4b6846d9dcd391164b72bd70e8a0c0e09776afe3 -- gitgitgadget ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] git-p4: preserve utf8 BOM when importing from p4 to git
@ 2022-12-14 6:10 Tzadik Vanderhoof
2022-12-14 8:29 ` Junio C Hamano
2022-12-14 18:24 ` Tao Klerks
0 siblings, 2 replies; 7+ messages in thread
From: Tzadik Vanderhoof @ 2022-12-14 6:10 UTC (permalink / raw)
To: Git List
Cc: Tao Klerks, Luke Diamand, Junio C Hamano, Eric Sunshine,
Ævar Arnfjörð Bjarmason, Dorgon Chang,
Joachim Kuebart, Daniel Levin, Johannes Schindelin, Ben Keene,
Andrew Oakley
From: Tao Klerks <tao@klerks.biz>
> Perforce has a file type "utf8" which represents a text file with
> explicit BOM. utf8-encoded files *without* BOM are stored as
> regular file type "text". The "utf8" file type behaves like text
> in all but one important way: it is stored, internally, without
> the leading 3 BOM bytes.
>
> git-p4 has historically imported utf8-with-BOM files (files stored,
> in Perforce, as type "utf8") the same way as regular text files -
> losing the BOM in the process.
>
> Under most circumstances this issue has little functional impact,
> as most systems consider the BOM to be optional and redundant, but
> this *is* a correctness failure, and can have lead to practical
> issues for example when BOMs are explicitly included in test files,
> for example in a file encoding test suite.
>
> Fix the handling of utf8-with-BOM files when importing changes from
> p4 to git, and introduce a test that checks it is working correctly
I don't really understand the problem that this patch addresses, but I
feel it was incorrect to modify the behavior of git-p4 in this way.
This change has made git-p4 behave differently than the native
Perforce tools.
Perforce already has a "variable" (setting) to control this behavior.
If P4CHARSET is set to "utf8-bom", the BOM will be added to files in
the local workspace of type "utf8". If P4CHARSET is set to "utf8",
the BOM will NOT be stored in the local workspace file, even if its
type is "utf8"!
Whatever the problem was that motivated this change, it should have
been solved using the Perforce variable P4CHARSET, not by modifying
the behavior of git-p4. This new behavior has made it impossible for
me to submit changes to files of type "utf8"! Any attempt fails with
"patch does not apply" and the erroneously added BOM is the cause.
I propose rolling back the patch that introduced this behavior, and if
that is not possible, I would like to submit a patch that adds a new
setting in the "git-p4" section that will enable users to opt out of
this new behavior (for example a boolean setting "git-p4.noUtf8Bom").
--
Tzadik
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] git-p4: preserve utf8 BOM when importing from p4 to git 2022-12-14 6:10 Tzadik Vanderhoof @ 2022-12-14 8:29 ` Junio C Hamano 2022-12-14 18:24 ` Tao Klerks 1 sibling, 0 replies; 7+ messages in thread From: Junio C Hamano @ 2022-12-14 8:29 UTC (permalink / raw) To: Tzadik Vanderhoof Cc: Git List, Tao Klerks, Luke Diamand, Eric Sunshine, Ævar Arnfjörð Bjarmason, Dorgon Chang, Joachim Kuebart, Daniel Levin, Johannes Schindelin, Ben Keene, Andrew Oakley Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> writes: > I propose rolling back the patch that introduced this behavior, and if > that is not possible, I would like to submit a patch that adds a new > setting in the "git-p4" section that will enable users to opt out of > this new behavior (for example a boolean setting "git-p4.noUtf8Bom"). If a change in the most recent broke P4 users who have happily been using "git p4", then we definitely should do both. Immediately revert in preparation for Git 2.39.1 or Git 2.39.2, while working to reintroduce it as a separate option to allow the feature to add BOM automatically to please users like Tao. But for such a resolution, this report should have come while fbe5f6b8 (git-p4: preserve utf8 BOM when importing from p4 to git, 2022-04-04) was still in flight, before it was merged for e4a4b315 (Git 2.37, 2022-06-27). We had three months to do so, and now it is way too late. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] git-p4: preserve utf8 BOM when importing from p4 to git 2022-12-14 6:10 Tzadik Vanderhoof 2022-12-14 8:29 ` Junio C Hamano @ 2022-12-14 18:24 ` Tao Klerks 2022-12-14 23:11 ` Junio C Hamano 1 sibling, 1 reply; 7+ messages in thread From: Tao Klerks @ 2022-12-14 18:24 UTC (permalink / raw) To: Tzadik Vanderhoof Cc: Git List, Luke Diamand, Junio C Hamano, Eric Sunshine, Ævar Arnfjörð Bjarmason, Dorgon Chang, Joachim Kuebart, Daniel Levin, Johannes Schindelin, Ben Keene, Andrew Oakley On Wed, Dec 14, 2022 at 7:11 AM Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> wrote: > > From: Tao Klerks <tao@klerks.biz> > > Fix the handling of utf8-with-BOM files when importing changes from > > p4 to git, and introduce a test that checks it is working correctly > > I don't really understand the problem that this patch addresses, but I > feel it was incorrect to modify the behavior of git-p4 in this way. > This change has made git-p4 behave differently than the native > Perforce tools. Hi Tzadik, first of all my apologies for the fact that these changes broke your flow - it was clearly not the intent to cause any users to have to change the way they do things (or, worse, make it impossible for them to work, or cause BOM-presence inconsistencies in their repos). When you say "This change has made git-p4 behave differently than the native Perforce tools", I suspect there's a misunderstanding, but you've clearly uncovered a git-p4 "workflow bug" that this change introduced. The git-p4 tooling can be and is used for a number of different things. If I understand your usecase correctly, you keep a local git repo that you work in, and others around you continue to use p4; you (presumably) don't share your git repo with any other users. Under those single-user circumstances, it makes sense that you expect your git repo to contain "your workspace's understanding" of the perforce repo/contents, rather than "the actual contents of the repo" - and in fact, if I understand correctly, the fact that your workspace has a specific view on what's in the repo (does UTF-8 BOM stripping) means that the git-p4 "submit"/"commit" command finds unexpected differences between your git files and your p4 workspace files, when trying to submit git changes into your p4 workspace. In other use-cases where the git repo is shared between multiple users, or where a perforce-to-git migration is facilitated by this tooling, it is critical that the result of importing a perforce repo/branch into git be "faithful", retaining the actual contents of UTF-8-BOM files - including those first 3 bytes that are ignored by many text-file-processing systems, but significant to others. > > Perforce already has a "variable" (setting) to control this behavior. > If P4CHARSET is set to "utf8-bom", the BOM will be added to files in > the local workspace of type "utf8". If P4CHARSET is set to "utf8", > the BOM will NOT be stored in the local workspace file, even if its > type is "utf8"! This behavior makes sense for workstation tooling like the "p4" tool, but to the best of my knowledge it does not make sense for "repo cloning" tooling. Unless I'm misunderstanding something, git-p4 was working as you wanted before these changes, precisely *because* you had specified P4CHARSET=utf8 in your workspace (and you're the only one using your git repo, or all others set P4CHARSET the same way and have the same BOMless perspective in their worktrees), so git-p4's buggy clone/sync behavior (swallowing UTF-8 BOMs when importing repo history from perforce) matched your p4 workspace's intentionally configured behavior. There is a complicating factor here that may be clouding my understanding: Perforce has two server modes of operation, "unicode enabled server" and... non-unicode-enabled? Normal? The idea that there are utf-8 files stored without BOM (and stored, in fact, as type "text"), and utf-8 files stored with BOM (as type "utf8") is a notion of the not-unicode-enabled Perforce servers. It may be that no such distinction exists / can be stored in "unicode enabled servers" - I need to do some testing with such servers to better understand how this works. As far as I understand, you must be using a "unicode enabled server" as the p4 docs say P4CHARSET must always be unset / set to "none" on "normal" servers (and must always be set, with "unicode enabled servers", even if implicitly to "auto"). > > Whatever the problem was that motivated this change, it should have > been solved using the Perforce variable P4CHARSET, not by modifying > the behavior of git-p4. Obviously if the conflict with workflows like yours had been understood, the change would not have been made as it was. However, I'm not sure that I understand your proposal here to solve a problem "using the Perforce variable P4CHARSET". The problem was that, on standard Perforce servers, when you add a UTF-8-BOM file, it gets stored with type "utf8". When someone else syncs that file, they get back what was put in - a UTF-8-BOM file. This is the way you generally expect/hope a VCS will typically work. Conversely, if you add a UTF-8 file that does *not* have the BOM, it will be stored as type "text" - and someone syncing it will, again, get out what was put in. This is perforce client tooling working correctly/normally. git-p4, on the other hand, does not use "p4 sync", it uses "p4 print", and with this command, the BOM is *not* included on "utf8" type files. This is not dependent on P4CHARSET, it's just an aspect of the contract of "p4 print" - it expects the recipient of text-derived files to parse the p4 "type" and handle any encoding transformations (like adding a BOM) accordingly. Again, I'm not attempting to defend the breakage - just outlining why I don't see how "using the Perforce variable P4CHARSET" would solve anything. > This new behavior has made it impossible for > me to submit changes to files of type "utf8"! Any attempt fails with > "patch does not apply" and the erroneously added BOM is the cause. I will try to understand the "unicode enabled server" behavior today or tomorrow and see what options might make sense. > I propose rolling back the patch that introduced this behavior, Junio is the expert here and has noted it's a little late for that. I obviously defer to his expertise as to git's release and backout strategy. I would like to have a go at understanding what the options are (how we can get correct and functional behavior for all users), before proposing a specific course of action. > and if > that is not possible, I would like to submit a patch that adds a new > setting in the "git-p4" section that will enable users to opt out of > this new behavior (for example a boolean setting "git-p4.noUtf8Bom"). I suspect that is the easiest tactical solution, but I'd be concerned that this would require users like you (on such unicode-enabled servers, with the P4CHARSET=utf8 setting locally), to "discover" the cause of submission errors they get, and discover the setting that would enable them to work around it. At least in the medium term, I would prefer to find a way to have git-p4 intentionally respect the "P4CHARSET" setting (with a warning about repo faithfulness if you're losing any data or causing potential repo inconsistency along the way). As a workaround for you in the immediate term (and sorry, I feel super-dirty even writing this) the only option I see for "git-p4 submit" to work would be for you to, when this happens, temporarily change your P4CHARSET value to "utf8-bom" and re-sync the affected files in your workspace, before you "git-p4 submit". I have not tested this, but given my limited understanding of your situation it seems like this would work. If you already have a different workaround please let us know what it is! Best regards, Tao ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] git-p4: preserve utf8 BOM when importing from p4 to git 2022-12-14 18:24 ` Tao Klerks @ 2022-12-14 23:11 ` Junio C Hamano 2022-12-19 9:09 ` Tao Klerks 0 siblings, 1 reply; 7+ messages in thread From: Junio C Hamano @ 2022-12-14 23:11 UTC (permalink / raw) To: Tao Klerks Cc: Tzadik Vanderhoof, Git List, Luke Diamand, Eric Sunshine, Ævar Arnfjörð Bjarmason, Dorgon Chang, Joachim Kuebart, Daniel Levin, Johannes Schindelin, Ben Keene, Andrew Oakley Tao Klerks <tao@klerks.biz> writes: > Again, I'm not attempting to defend the breakage - just outlining why > I don't see how "using the Perforce variable P4CHARSET" would solve > anything. > >> This new behavior has made it impossible for >> me to submit changes to files of type "utf8"! Any attempt fails with >> "patch does not apply" and the erroneously added BOM is the cause. > > I will try to understand the "unicode enabled server" behavior today > or tomorrow and see what options might make sense. > >> I propose rolling back the patch that introduced this behavior, > > Junio is the expert here and has noted it's a little late for that. I > obviously defer to his expertise as to git's release and backout > strategy. > > I would like to have a go at understanding what the options are (how > we can get correct and functional behavior for all users), before > proposing a specific course of action. It sounds like, if your conjecture turns out to be correct in that those P4 users who interact unicode enabled servers would have P4CHARSET and others don't, we may not need an extra configuration but pay attention to the P4CHARSET variable (or lack of it) and switch the behaviour. Thanks for a well reasoned response. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] git-p4: preserve utf8 BOM when importing from p4 to git 2022-12-14 23:11 ` Junio C Hamano @ 2022-12-19 9:09 ` Tao Klerks 2022-12-22 8:26 ` Tao Klerks 0 siblings, 1 reply; 7+ messages in thread From: Tao Klerks @ 2022-12-19 9:09 UTC (permalink / raw) To: Junio C Hamano Cc: Tzadik Vanderhoof, Git List, Luke Diamand, Eric Sunshine, Ævar Arnfjörð Bjarmason, Dorgon Chang, Joachim Kuebart, Daniel Levin, Johannes Schindelin, Ben Keene, Andrew Oakley On Thu, Dec 15, 2022 at 12:11 AM Junio C Hamano <gitster@pobox.com> wrote: > > Tao Klerks <tao@klerks.biz> writes: > > > Again, I'm not attempting to defend the breakage - just outlining why > > I don't see how "using the Perforce variable P4CHARSET" would solve > > anything. > > > >> This new behavior has made it impossible for > >> me to submit changes to files of type "utf8"! Any attempt fails with > >> "patch does not apply" and the erroneously added BOM is the cause. > > > > I will try to understand the "unicode enabled server" behavior today > > or tomorrow and see what options might make sense. > > > >> I propose rolling back the patch that introduced this behavior, > > > > Junio is the expert here and has noted it's a little late for that. I > > obviously defer to his expertise as to git's release and backout > > strategy. > > > > It sounds like, if your conjecture turns out to be correct in that > those P4 users who interact unicode enabled servers would have > P4CHARSET and others don't, we may not need an extra configuration > but pay attention to the P4CHARSET variable (or lack of it) and > switch the behaviour. Yes, I suspect some sort of detection will be required. There appears to be no way to query the server for this "unicode mode" directly, but you can force the client to try connecting in the "wrong" mode for the server, and catch the corresponding error. Ugly, but effective. (the reason it's hard to just test for "P4CHARSET" is that there are several ways to set it, not just the environment, and there are multiple versions of the setting, per-connection or global; setting the global override and testing for failure is likely to be safer than attempting to understand/evaluate the hierarchy of settings) > > I would like to have a go at understanding what the options are (how > > we can get correct and functional behavior for all users), before > > proposing a specific course of action. I have finally managed to start testing with the "unicode enabled server" behavior. So far I've learned that: * Some of our tests around file content encoding handling do fail with the server in this mode (not necessarily because we're doing the wrong thing, but because the server's behavior doesn't match our expectations) these failures may correspond to bugs to be fixed, or tests to be adjusted to match appropriate expectations in this "unicode enabled mode" * Our tests around "git p4 submit" *don't* seem to fail, even on utf-8-bom files - so I have not yet reproduced Tzadik's issue (I keep placing "unicode enabled server" in quotes because I don't want to give the impression that perforce in "normal" mode doesn't handle unicode content - it absolutely does, but... differently.) I definitely need to keep testing around this to understand what the right thing to do for Tzadik (and others like him of course) might be. Tzadik, could you provide any more detail about the failing situation? One piece of info that might be particularly helpful is *what is the exact/full p4 FileType of the problem file?* Thanks, Tao ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] git-p4: preserve utf8 BOM when importing from p4 to git 2022-12-19 9:09 ` Tao Klerks @ 2022-12-22 8:26 ` Tao Klerks 0 siblings, 0 replies; 7+ messages in thread From: Tao Klerks @ 2022-12-22 8:26 UTC (permalink / raw) To: Tzadik Vanderhoof Cc: Junio C Hamano, Git List, Luke Diamand, Eric Sunshine, Ævar Arnfjörð Bjarmason, Dorgon Chang, Joachim Kuebart, Daniel Levin, Johannes Schindelin, Ben Keene, Andrew Oakley On Mon, Dec 19, 2022 at 10:09 AM Tao Klerks <tao@klerks.biz> wrote: > > I definitely need to keep testing around this to understand what the > right thing to do for Tzadik (and others like him of course) might be. > > Tzadik, could you provide any more detail about the failing situation? > One piece of info that might be particularly helpful is *what is the > exact/full p4 FileType of the problem file?* Hi Tzadik, over the past few days I've been making progress towards understanding the behavior of unicode-enabled perforce servers (and p4 client tooling against such servers): * Perforce always stores/adds new files containing a utf8 bom as type "utf8", and always syncs them to the workspace with a BOM, regardless of the P4CHARSET env variable. Same goes for utf16 with bom being stored as type "utf16". * The only filetype that gets synced to the workspace differently depending on P4CHARSET is the "unicode" type - a type exclusive to *specific files* on unicode-enabled servers * The P4CHARSET env var is used in three situations that I've detected/understood so far: 1. It is used when "p4 add"ing new files - if the content of the file is valid/parseable according to the P4CHARSET, and is not a built-in more-specific type like utf8-with-bom (perforce type "utf8") or utf16-with-bom (perforce type "utf16"), then it gets parsed according to that charset and stored as type "unicode". If the content is not parseable according to that charset, nor the more specific types, then it gets stored as "text" or "binary". 2. It is used when "p4 sync"ing files of type "unicode" to the workspace. All other types are synced in a "standard" way, only type "unicode" is sensitive to the charset config during "sync". 3. It is used when "p4 print"ing files of type "unicode", "utf8", and "utf16" (unless P4COMMANDCHARSET overrides, or a "-o" output file is specified). This last behavior spells trouble for git-p4... * When a file is stored as type "unicode", the way it gets synchronized to the workspace always depends on the P4CHARSET env variable. When a "with-bom" charset value is configured, like "utf8-bom" or "utf16", the resulting files on disk / in the workspace end up indistinguishable from the corresponding perforce native with-bom type "utf8" or "utf16". * If an output file (-o) is specified, "p4 print" behaves like "p4 sync" - using P4CHARSET only for file type "unicode". * If no output file (-o) is specified, the output of "p4 print" depends on the P4CHARSET env variable or the more-specific P4COMMANDCHARSET if specified, but with a number of provisos: ** "p4 print" will refuse to print using a "wide" charset like "utf16"; it will fall back to utf8. ** If you configure P4CHARSET (and/or P4COMMANDCHARSET) to a charset value with bom ("utf8-bom" specifically, I guess), you don't in fact get the BOM in the "p4 print" output. ** If you configure P4CHARSET (and/or P4COMMANDCHARSET) to a charset value that cannot express the specific contents of a utf8, utf16 or unicode file, "p4 print" will fail. * The behavior of "p4 print" matters in git-p4 because this is the p4 command used to import files; for "utf16" files a "file-targeting" form is used (-o), and for all other file types, the regular form. * One other "weirdness" in all this is that if you remove the BOM from a "utf8" file and submit, this change is silently swallowed; your workspace copy of the file remains without BOM, but if you force-sync the file, the BOM magically reappears. This is all further complicated by the fact that older Perforce server (and client?) versions behaved differently! (eg see https://stackoverflow.com/questions/38320814/bom-issue-in-unicode-perforce-server). So far in my testing of git-p4 against a unicode server, I've reproduced two concrete "problem workflows" specific to this server configuration: 1: If you have P4CHARSET configured to "utf8-bom" * "git p4 clone" imports your "unicode" files without BOM (because "p4 print" always omits the BOM) * "p4 sync" syncs your "unicode" files *with* BOM - because that's what P4CHARSET is telling it to do * "git p4 submit" fails to apply (some) git changes (to "unicode" files) because the git-workspace files do not have a BOM, and the p4 workspace files do. 2: If you have P4CHARSET configured to a non-utf8 encoding like "winansi": * "git p4 clone" imports your "utf8" files as winansi-encoded (because "p4 print" respects the configured P4CHARSET), and adds a utf8 bom (because of the fix earlier this year) * "p4 sync" syncs your "utf8" files encoded as utf8-with-BOM - because that's the only and correct behavior * Some of these files in the git repo may not open correctly in any given editor, if they have a utf8 bom but are not actually utf8-encoded (if they contain characters in the 128+ codepoint range) * "git p4 submit" fails to apply (some) git changes (to these "utf8" files) because these git-workspace files can have different bytes to the utf8-encoded files in the workspace. There seem to be two corresponding "fundamental bugs" in "git-p4 clone": * Files of type "unicode" are imported according to the client-configured charset (with some inconsistency around wide charsets like utf16) - correct behavior is hard to define, as perforce treats these files as having no "fundamental" encoding, but current behavior is clearly buggy. It's unclear whether properly/consistently respecting the p4-client-configured encoding makes more sense (to support local-git-repo workflows transparently) by default, or whether it makes more to use "utf8" by default, to have a consistent and correct representation of the Perforce repository contents in git by default (and then the equivalent of P4CHARSET behavior in individual/personal git repos would need to be implemented using git smudge and clean filters, I assume, for "git p4 submit" to work against non-utf8-configured p4 workspaces). * If unicode mode is enabled and P4CHARSET is not set to "utf8" or "utf8-bom", files of perforce type "utf8" will be imported according to the client-configured charset instead of the actual/internal (and p4-workspace-respected) encoding - and additionally, with "my fix", they will still get a utf8 BOM prepended, which can result in corrupt files in the git repo. Perplexingly, most of the issues I've identified with unicode-mode perforce servers are not caused by my changes: the only *new* problem this year is that, if P4CHARSET is set to something other than "utf8" or "utf8-bom", files of p4 type "utf8" will now not only be imported into the git repo with the *wrong* encoding (prior behavior), but will additionally have a utf8 BOM added, resulting in *corrupted* files (if/when the charset encoding diverges, eg if the content includes characters in the 128+ codepoint range for P4CHARSET=winansi). It's worth noting there's also an existing workflow failure that applies equally to unicode-mode and regular servers, when users delete the BOM from existing "utf8" files in their git repo and submit that change into perforce: * "git p4 clone" imports your "utf8" files correctly, as utf8-with-BOM (in the case of a unicode-mode perforce server, assume "P4CHARSET" is set to "utf8" or "utf8-bom") * "p4 sync" syncs your "utf8" files encoded as utf8-with-BOM - because that's the only and correct behavior * The user accidentally or intentionally removes the BOM as part of some edits to the file in the git workspace * "git p4 submit" applies the change to the p4 workspace successfully, it is committed to perforce (but perforce ignores the BOM removal bit) * "git p4 submit" does its rebase of the git branch on top of the now-submitted-and-reimported p4 changes, but the rebase fails because the changes are different: in the p4 history there was no BOM-removal. Tzadik, could you please confirm: * Your p4 client and server versions ("p4 info") * The perforce "types" of files you are concerned with (check with "p4 files <PATH>" or in the p4v GUI) * Your OS * Your P4CHARSET value, if set explicitly to something other than "auto" I will keep exploring how this all works, and proposing fixes to git-p4's vulnerable use of "p4 print". The correct outcome/behavior is clear (and achievable) for files of perforce type "utf8", but I'm not confident as to what the best default behavior will be for type "unicode" - I imagine it will be to keep current client-specific local-workflow-supporting behavior, fixing it to support wide encodings and BOM-specifying encodings, and recommending that P4CHARSET be set to "utf8" explicitly for canonical git-p4 clone operations / for shared repos. I have not yet looked into filepath-handling, commit-message-handling, username-handling etc on unicode-enabled servers, but I suspect similar issues will apply there: we'll have to figure out what the correct behavior would be for "canonical" git repos, whether it should be any different for "local-workflow-only" git repos, and how either desirable behavior compares to current behavior for various values of P4CHARSET. Everything probably already works correctly when P4CHARSET=utf8. For the "reappearing BOM" problem/behavior of perforce, I'm also not sure what the right approach is, even manually. I've tried asking p4 to reevaluate the "type" of the file upon edit, but can't find any way to get it to *auto*-detect the new file type. Given that I've never seen this issue described before, it doesn't feel hugely important though. I will at least add a failing test case to cover this, when I get to it. For now, based on the original problem statement/bug report, I cannot understand what specific problem is affecting Tzadik, and whether or how it was caused by the 2022 utf8 bom-preserving fix to "git-p4 clone". Thanks, Tao ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-12-22 8:26 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-04-04 5:50 [PATCH] git-p4: preserve utf8 BOM when importing from p4 to git Tao Klerks via GitGitGadget -- strict thread matches above, loose matches on Subject: below -- 2022-12-14 6:10 Tzadik Vanderhoof 2022-12-14 8:29 ` Junio C Hamano 2022-12-14 18:24 ` Tao Klerks 2022-12-14 23:11 ` Junio C Hamano 2022-12-19 9:09 ` Tao Klerks 2022-12-22 8:26 ` Tao Klerks
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).