git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Feature request: better error messages when UTF-8 bites
@ 2022-07-27 20:21 CH
  2022-07-28  5:42 ` Johannes Sixt
  0 siblings, 1 reply; 4+ messages in thread
From: CH @ 2022-07-27 20:21 UTC (permalink / raw)
  To: git

Hi;

Just found an annoyance in `git log` (and likely elsewhere) that may 
warrant a change:

Somehow when copying and pasting a commit from a website to the command 
line, a UTF-8 Byte Order Mark (BOM) 
[https://en.wikipedia.org/wiki/Byte_order_mark] was appended to one of 
the commit ids.  BOMs are invisible, as are many other UTF-8 code 
points.  The upshot was that Git didn't like it, and complained 
bitterly:

> $ strace -etrace=execve -s 200 git diff 
> 038179704f0066aa815d5429221cf381ff4ef289  
> 47346a462d8ba40b9a8b073e351c362522c46aa6
> 
> execve("/usr/bin/git", ["git", "diff", 
> "038179704f0066aa815d5429221cf381ff4ef289\357\273\277", 
> "47346a462d8ba40b9a8b073e351c362522c46aa6"], 0x7fffec3c4bb0 /* 80 vars 
> */) = 0
> 
> fatal: ambiguous argument '038179704f0066aa815d5429221cf381ff4ef289': 
> unknown revision or path not in the working tree.
> Use '--' to separate paths from revisions, like this:
> 'git <command> [<revision>...] -- [<file>...]'
> +++ exited with 128 +++

Feature request:
================

When printing the "fatal: ambiguous argument '......': ....", perhaps 
escape (url or otherwise) the ambiguous argument when printing it in the 
error message, or maybe add a sentence about non-ASCII characters being 
found.

This is sort of a difficult corner-case, in that it is perfectly legal 
to have UTF-8 characters in a branch or tag name (see 
git-check-ref-format for the allowed characters), so someone could 
indeed create a branch named 
"038179704f0066aa815d5429221cf381ff4ef289\357\273\277" if they were a 
tortured soul bent on overthrowing polite society.  Rejecting input 
because it has bytes with values above \177 is therefore not a solution.

Similarly, scanning the input for invisible UTF-8 characters (or even 
invalid UTF-8 sequences) is leaning too far the other way: git should 
not be validating character encodings.  It should stay encoding-neutral, 
as the alternative leads to madness, driving developers into becoming 
tortured souls bent on rigidly enforcing polite society.  We have enough 
of those already.

It's unclear as to whether violent overthrow or rigid enforcement is the 
lesser of two evils, but let's not perform the experiment to find out.  
:-)

Cheers!

-- 
CH (ch-and-git.vger.kernel.org@ch.pkts.ca)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Feature request: better error messages when UTF-8 bites
  2022-07-27 20:21 Feature request: better error messages when UTF-8 bites CH
@ 2022-07-28  5:42 ` Johannes Sixt
  2022-07-28  9:40   ` Thomas Guyot
  0 siblings, 1 reply; 4+ messages in thread
From: Johannes Sixt @ 2022-07-28  5:42 UTC (permalink / raw)
  To: CH; +Cc: git

Am 27.07.22 um 22:21 schrieb CH:
> Somehow when copying and pasting a commit from a website to the command
> line, a UTF-8 Byte Order Mark (BOM)
> [https://en.wikipedia.org/wiki/Byte_order_mark] was appended to one of
> the commit ids.  BOMs are invisible, as are many other UTF-8 code
> points.  The upshot was that Git didn't like it, and complained bitterly:
> 
>> $ strace -etrace=execve -s 200 git diff
>> 038179704f0066aa815d5429221cf381ff4ef289 
>> 47346a462d8ba40b9a8b073e351c362522c46aa6
>>
>> execve("/usr/bin/git", ["git", "diff",
>> "038179704f0066aa815d5429221cf381ff4ef289\357\273\277",
>> "47346a462d8ba40b9a8b073e351c362522c46aa6"], 0x7fffec3c4bb0 /* 80 vars
>> */) = 0
>>
>> fatal: ambiguous argument '038179704f0066aa815d5429221cf381ff4ef289':
>> unknown revision or path not in the working tree.
>> Use '--' to separate paths from revisions, like this:
>> 'git <command> [<revision>...] -- [<file>...]'
>> +++ exited with 128 +++
> 
> Feature request:
> ================
> 
> When printing the "fatal: ambiguous argument '......': ....", perhaps
> escape (url or otherwise) the ambiguous argument when printing it in the
> error message, or maybe add a sentence about non-ASCII characters being
> found.

That's not going to fly, IMHO, because when I type

   git diff todo/René

I would not want to see

fatal: ambiguous argument 'todo/Ren\303\251': unknown ...

I'm convinced that there are thousands of users who use non-ASCII branch
and file names that they also frequently mis-type. They'd all be greeted
with unintelligible nerdy gibberish.

I may be able to change my mind if ambiguous input (in the sense of "is
not what it seems to be") leads to a security hazard that is unique to Git.

-- Hannes

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Feature request: better error messages when UTF-8 bites
  2022-07-28  5:42 ` Johannes Sixt
@ 2022-07-28  9:40   ` Thomas Guyot
  2022-07-28 18:01     ` Torsten Bögershausen
  0 siblings, 1 reply; 4+ messages in thread
From: Thomas Guyot @ 2022-07-28  9:40 UTC (permalink / raw)
  To: Johannes Sixt, CH; +Cc: git

On 2022-07-28 01:42, Johannes Sixt wrote:
> Am 27.07.22 um 22:21 schrieb CH:
>> Somehow when copying and pasting a commit from a website to the command
>> line, a UTF-8 Byte Order Mark (BOM)
>> [https://en.wikipedia.org/wiki/Byte_order_mark] was appended to one of
>> the commit ids.  BOMs are invisible, as are many other UTF-8 code
>> points.  The upshot was that Git didn't like it, and complained bitterly:
>>
>>> $ strace -etrace=execve -s 200 git diff
>>> 038179704f0066aa815d5429221cf381ff4ef289
>>> 47346a462d8ba40b9a8b073e351c362522c46aa6
>>>
>>> execve("/usr/bin/git", ["git", "diff",
>>> "038179704f0066aa815d5429221cf381ff4ef289\357\273\277",
>>> "47346a462d8ba40b9a8b073e351c362522c46aa6"], 0x7fffec3c4bb0 /* 80 vars
>>> */) = 0
>>>
>>> fatal: ambiguous argument '038179704f0066aa815d5429221cf381ff4ef289':
>>> unknown revision or path not in the working tree.
>>> Use '--' to separate paths from revisions, like this:
>>> 'git <command> [<revision>...] -- [<file>...]'
>>> +++ exited with 128 +++
>> Feature request:
>> ================
>>
>> When printing the "fatal: ambiguous argument '......': ....", perhaps
>> escape (url or otherwise) the ambiguous argument when printing it in the
>> error message, or maybe add a sentence about non-ASCII characters being
>> found.
> That's not going to fly, IMHO, because when I type
>
>     git diff todo/René
>
> I would not want to see
>
> fatal: ambiguous argument 'todo/Ren\303\251': unknown ...

This is actually already MUCH better that the OP's example. In his 
example he has a string that looks like a 40-char hash, and git 
complains without showing any of the Unicode gibberish attached to that 
sha1. It would be better if at least it printed something in ASCII with 
escaped bytes in the error message.

Moreover this isn't even close to the issue above - he's talking about a 
no-op, non-printing Unicode marker that crept in. While I do think it 
shouldn't be an issue, it shouldn't even have been passed to git. IMHO 
it should have been stripped by the browser itself on copy, or by the 
terminal on paste... FWIW I'm using rxvt-unicode, and copying this from 
the terminal doesn't copy the marker but pasting the marker copied from 
Chrome is passed on to bash and git.

NB: I also though what if the shell handled it, but that isn't even 
really a character so not technically suitable for $IFS, and even if we 
considered that option it wouldn't really play well with POSIX's 
definition of $IFS - how to tell for example between a single Unicode 
codepoint and a list of binary characters? There is just no definition 
of wide chars for $IFS, not in POSIX nor in recent versions of Bash AFAIK.


TL;DR; the issue is IMHO on the browser side, which shouldn't include 
the marker in the copied text, or maybe on the terminal, BUT when passed 
on to git it should at least print the escaped Unicode chars in the 
error, otherwise it's just too confusing for the user.


BTW you actually raise another issue - I do think for file paths git 
could either recompose (NFC) or decompose (NFD) the strings on storage 
and comparison (which should probably be an option... the current 
default for 2.30.2 is to treat them and print them as binary (escaped on 
print). Consider the following when using core.quotePath=false:

$ touch "nfc_$(printf '\xf4')"
$ touch "nfd_$(printf '\x6f\xcc\x82')"
$ git add nf[cd]*
$ git status
On branch test
Changes to be committed:
   (use "git restore --staged <file>..." to unstage)
     new file:   nfc_ô
     new file:   nfd_ô

I'm not sure how the Unicode will be translated here, it might depend on 
the mail client if they even's get sent as-is, but both shows the exact 
same file name, one in NFD and one in NFC format.

Both are canonically equivalent and reversible. It appears MacOS already 
decompose (NFD?) filenames by default and git provides an option to 
recompose the characters (core.precomposeUnicode) which, according to 
the manual, is not even usable on Linux...

More on Unicode normalization: https://unicode.org/reports/tr15/

--
Thomas

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Feature request: better error messages when UTF-8 bites
  2022-07-28  9:40   ` Thomas Guyot
@ 2022-07-28 18:01     ` Torsten Bögershausen
  0 siblings, 0 replies; 4+ messages in thread
From: Torsten Bögershausen @ 2022-07-28 18:01 UTC (permalink / raw)
  To: Thomas Guyot; +Cc: Johannes Sixt, CH, git

[]

>
> BTW you actually raise another issue - I do think for file paths git could
> either recompose (NFC) or decompose (NFD) the strings on storage and
> comparison (which should probably be an option... the current default for
> 2.30.2 is to treat them and print them as binary (escaped on print).
> Consider the following when using core.quotePath=false:
>
> $ touch "nfc_$(printf '\xf4')"

This is not valid unicode, isn't it ?
Probably we need to use the octal version, since not all
printf() implementation support hex values starting with 0x,
but all of them support octal:

auml=$(printf '\303\244')
aumlcdiar=$(printf '\141\314\210')


> $ touch "nfd_$(printf '\x6f\xcc\x82')"
> $ git add nf[cd]*
> $ git status
> On branch test
> Changes to be committed:
>   (use "git restore --staged <file>..." to unstage)
>     new file:   nfc_ô
>     new file:   nfd_ô

>
> I'm not sure how the Unicode will be translated here, it might depend on the
> mail client if they even's get sent as-is, but both shows the exact same
> file name, one in NFD and one in NFC format.

Translated by "whom" ?
Most programs do no translate anything here.
>
> Both are canonically equivalent and reversible. It appears MacOS already
> decompose (NFD?) filenames by default and git provides an option to
> recompose the characters (core.precomposeUnicode) which, according to the
> manual, is not even usable on Linux...

Yes. Technically you can have both under Linux, at least unless you
are running ZFS, which may be created unicode-aware (or not, that is the default).

But why do you want to have 2 files on disk with different normalizations ?


>
> More on Unicode normalization: https://unicode.org/reports/tr15/
>
> --
> Thomas

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-07-28 18:01 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-27 20:21 Feature request: better error messages when UTF-8 bites CH
2022-07-28  5:42 ` Johannes Sixt
2022-07-28  9:40   ` Thomas Guyot
2022-07-28 18:01     ` Torsten Bögershausen

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).