From: Thomas Guyot <tguyot@gmail.com>
To: Johannes Sixt <j6t@kdbg.org>, CH <ch-and-git.vger.kernel.org@ch.pkts.ca>
Cc: git@vger.kernel.org
Subject: Re: Feature request: better error messages when UTF-8 bites
Date: Thu, 28 Jul 2022 05:40:42 -0400 [thread overview]
Message-ID: <1e454493-ca57-ca2e-7d82-7333a769817e@gmail.com> (raw)
In-Reply-To: <4b09bf98-dae2-491e-9858-801a9bcdd2fa@kdbg.org>
On 2022-07-28 01:42, Johannes Sixt wrote:
> Am 27.07.22 um 22:21 schrieb CH:
>> Somehow when copying and pasting a commit from a website to the command
>> line, a UTF-8 Byte Order Mark (BOM)
>> [https://en.wikipedia.org/wiki/Byte_order_mark] was appended to one of
>> the commit ids. BOMs are invisible, as are many other UTF-8 code
>> points. The upshot was that Git didn't like it, and complained bitterly:
>>
>>> $ strace -etrace=execve -s 200 git diff
>>> 038179704f0066aa815d5429221cf381ff4ef289
>>> 47346a462d8ba40b9a8b073e351c362522c46aa6
>>>
>>> execve("/usr/bin/git", ["git", "diff",
>>> "038179704f0066aa815d5429221cf381ff4ef289\357\273\277",
>>> "47346a462d8ba40b9a8b073e351c362522c46aa6"], 0x7fffec3c4bb0 /* 80 vars
>>> */) = 0
>>>
>>> fatal: ambiguous argument '038179704f0066aa815d5429221cf381ff4ef289':
>>> unknown revision or path not in the working tree.
>>> Use '--' to separate paths from revisions, like this:
>>> 'git <command> [<revision>...] -- [<file>...]'
>>> +++ exited with 128 +++
>> Feature request:
>> ================
>>
>> When printing the "fatal: ambiguous argument '......': ....", perhaps
>> escape (url or otherwise) the ambiguous argument when printing it in the
>> error message, or maybe add a sentence about non-ASCII characters being
>> found.
> That's not going to fly, IMHO, because when I type
>
> git diff todo/René
>
> I would not want to see
>
> fatal: ambiguous argument 'todo/Ren\303\251': unknown ...
This is actually already MUCH better that the OP's example. In his
example he has a string that looks like a 40-char hash, and git
complains without showing any of the Unicode gibberish attached to that
sha1. It would be better if at least it printed something in ASCII with
escaped bytes in the error message.
Moreover this isn't even close to the issue above - he's talking about a
no-op, non-printing Unicode marker that crept in. While I do think it
shouldn't be an issue, it shouldn't even have been passed to git. IMHO
it should have been stripped by the browser itself on copy, or by the
terminal on paste... FWIW I'm using rxvt-unicode, and copying this from
the terminal doesn't copy the marker but pasting the marker copied from
Chrome is passed on to bash and git.
NB: I also though what if the shell handled it, but that isn't even
really a character so not technically suitable for $IFS, and even if we
considered that option it wouldn't really play well with POSIX's
definition of $IFS - how to tell for example between a single Unicode
codepoint and a list of binary characters? There is just no definition
of wide chars for $IFS, not in POSIX nor in recent versions of Bash AFAIK.
TL;DR; the issue is IMHO on the browser side, which shouldn't include
the marker in the copied text, or maybe on the terminal, BUT when passed
on to git it should at least print the escaped Unicode chars in the
error, otherwise it's just too confusing for the user.
BTW you actually raise another issue - I do think for file paths git
could either recompose (NFC) or decompose (NFD) the strings on storage
and comparison (which should probably be an option... the current
default for 2.30.2 is to treat them and print them as binary (escaped on
print). Consider the following when using core.quotePath=false:
$ touch "nfc_$(printf '\xf4')"
$ touch "nfd_$(printf '\x6f\xcc\x82')"
$ git add nf[cd]*
$ git status
On branch test
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
new file: nfc_ô
new file: nfd_ô
I'm not sure how the Unicode will be translated here, it might depend on
the mail client if they even's get sent as-is, but both shows the exact
same file name, one in NFD and one in NFC format.
Both are canonically equivalent and reversible. It appears MacOS already
decompose (NFD?) filenames by default and git provides an option to
recompose the characters (core.precomposeUnicode) which, according to
the manual, is not even usable on Linux...
More on Unicode normalization: https://unicode.org/reports/tr15/
--
Thomas
next prev parent reply other threads:[~2022-07-28 9:40 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-27 20:21 Feature request: better error messages when UTF-8 bites CH
2022-07-28 5:42 ` Johannes Sixt
2022-07-28 9:40 ` Thomas Guyot [this message]
2022-07-28 18:01 ` Torsten Bögershausen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1e454493-ca57-ca2e-7d82-7333a769817e@gmail.com \
--to=tguyot@gmail.com \
--cc=ch-and-git.vger.kernel.org@ch.pkts.ca \
--cc=git@vger.kernel.org \
--cc=j6t@kdbg.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).