git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Re: b4: unicode control characters -- warn or remove?
       [not found] <20211101175020.5r4cwmy4qppi7dis@meerkat.local>
@ 2021-11-01 19:09 ` Eric Wong
  2021-11-01 19:17   ` Konstantin Ryabitsev
  2021-11-01 20:02   ` Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 7+ messages in thread
From: Eric Wong @ 2021-11-01 19:09 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: users, tools, git

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> Hi, all:
> 
> Per exhibit a, what should we do in the situation where we discover unicode
> control characters in an email?
> 
> 1. Warn and strip these chars out, because they are extremely unlikely to be
>    doing anything legitimate in the context of a patch (unless someone is
>    sending patches for docs actually written in RTL languages)
> 2. Warn and error out, refusing to produce an mbox
> 3. Just warn and produce an mbox anyway
> 
> I'd normally do #3, but with many people piping things to git-am, I'm not sure
> if it's the safest choice.
> 
> Exibit a: https://lwn.net/Articles/874546/

+Cc: git@vger

IMHO, defense for this belongs in git-am (which already checks
things like whitespace).

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: b4: unicode control characters -- warn or remove?
  2021-11-01 19:09 ` b4: unicode control characters -- warn or remove? Eric Wong
@ 2021-11-01 19:17   ` Konstantin Ryabitsev
  2021-11-01 20:02   ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 7+ messages in thread
From: Konstantin Ryabitsev @ 2021-11-01 19:17 UTC (permalink / raw)
  To: Eric Wong; +Cc: users, tools, git

On Mon, Nov 01, 2021 at 07:09:05PM +0000, Eric Wong wrote:
> > Per exhibit a, what should we do in the situation where we discover unicode
> > control characters in an email?
> > 
> > 1. Warn and strip these chars out, because they are extremely unlikely to be
> >    doing anything legitimate in the context of a patch (unless someone is
> >    sending patches for docs actually written in RTL languages)
> > 2. Warn and error out, refusing to produce an mbox
> > 3. Just warn and produce an mbox anyway
> > 
> > I'd normally do #3, but with many people piping things to git-am, I'm not sure
> > if it's the safest choice.
> > 
> > Exibit a: https://lwn.net/Articles/874546/
> 
> +Cc: git@vger
> 
> IMHO, defense for this belongs in git-am (which already checks
> things like whitespace).

I agree, but even if that is implemented in git, we'll still probably want to
catch this on the b4 side of things until everyone uses the git client where
that's handled natively.

-K

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: b4: unicode control characters -- warn or remove?
  2021-11-01 19:09 ` b4: unicode control characters -- warn or remove? Eric Wong
  2021-11-01 19:17   ` Konstantin Ryabitsev
@ 2021-11-01 20:02   ` Ævar Arnfjörð Bjarmason
  2021-11-01 20:22     ` Konstantin Ryabitsev
  1 sibling, 1 reply; 7+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-11-01 20:02 UTC (permalink / raw)
  To: Eric Wong; +Cc: Konstantin Ryabitsev, users, tools, git


On Mon, Nov 01 2021, Eric Wong wrote:

> Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
>> Hi, all:
>> 
>> Per exhibit a, what should we do in the situation where we discover unicode
>> control characters in an email?
>> 
>> 1. Warn and strip these chars out, because they are extremely unlikely to be
>>    doing anything legitimate in the context of a patch (unless someone is
>>    sending patches for docs actually written in RTL languages)
>> 2. Warn and error out, refusing to produce an mbox
>> 3. Just warn and produce an mbox anyway
>> 
>> I'd normally do #3, but with many people piping things to git-am, I'm not sure
>> if it's the safest choice.
>> 
>> Exibit a: https://lwn.net/Articles/874546/
>
> +Cc: git@vger
>
> IMHO, defense for this belongs in git-am (which already checks
> things like whitespace).

It checks whitespace because that's something that's commonly a source
of patch corruption. I'm not adverse to adding this to core.whitespace,
but trying to catch malicious injected code seems like a rather big
expansion of its scope, particularly since:

    "[...]sending patches for docs actually written in RTL languages[...]"

Or just code? People write comment and even in their native languages,
and not all projects are as anglo-centric as those hosted on kernel.org.

I haven't checked what the overlap is between solving this issue & i18n
support, but we definitely should not be assuming that git's only using
by kernel.org users & similar, even something as relatively obscure as
git-am.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: b4: unicode control characters -- warn or remove?
  2021-11-01 20:02   ` Ævar Arnfjörð Bjarmason
@ 2021-11-01 20:22     ` Konstantin Ryabitsev
  2021-11-01 20:49       ` Pavel Machek
  2021-11-02 14:09       ` Konstantin Ryabitsev
  0 siblings, 2 replies; 7+ messages in thread
From: Konstantin Ryabitsev @ 2021-11-01 20:22 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Eric Wong, users, tools, git

On Mon, Nov 01, 2021 at 09:02:34PM +0100, Ævar Arnfjörð Bjarmason wrote:
> It checks whitespace because that's something that's commonly a source
> of patch corruption. I'm not adverse to adding this to core.whitespace,
> but trying to catch malicious injected code seems like a rather big
> expansion of its scope, particularly since:
> 
>     "[...]sending patches for docs actually written in RTL languages[...]"
> 
> Or just code? People write comment and even in their native languages,
> and not all projects are as anglo-centric as those hosted on kernel.org.

My comment about docs was purely within the scope of the Linux kernel.

I think the following would be a sane check:

1. are there unicode control characters (CCs) present?
2. are there other characters from RTL languages present in the same line?

if both 1 && 2 are true, this is a legitimate use of Unicode CCs. If only 1 is
true, then it's likely worth a warning.

Maybe even relax #2 to just check for unicode characters above a certain
barrier where RTL languages live. I think everyone will agree that if there
are unicode CCs and no other unicode characters in that same line, it's likely
not a legitimate use of control characters.

-K

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: b4: unicode control characters -- warn or remove?
  2021-11-01 20:22     ` Konstantin Ryabitsev
@ 2021-11-01 20:49       ` Pavel Machek
  2021-11-01 21:02         ` Konstantin Ryabitsev
  2021-11-02 14:09       ` Konstantin Ryabitsev
  1 sibling, 1 reply; 7+ messages in thread
From: Pavel Machek @ 2021-11-01 20:49 UTC (permalink / raw)
  To: Konstantin Ryabitsev
  Cc: Ævar Arnfjörð Bjarmason, Eric Wong, users, tools,
	git

[-- Attachment #1: Type: text/plain, Size: 1429 bytes --]

Hi!

> > It checks whitespace because that's something that's commonly a source
> > of patch corruption. I'm not adverse to adding this to core.whitespace,
> > but trying to catch malicious injected code seems like a rather big
> > expansion of its scope, particularly since:
> > 
> >     "[...]sending patches for docs actually written in RTL languages[...]"
> > 
> > Or just code? People write comment and even in their native languages,
> > and not all projects are as anglo-centric as those hosted on kernel.org.
> 
> My comment about docs was purely within the scope of the Linux kernel.
> 
> I think the following would be a sane check:
> 
> 1. are there unicode control characters (CCs) present?
> 2. are there other characters from RTL languages present in the same line?
> 
> if both 1 && 2 are true, this is a legitimate use of Unicode CCs. If only 1 is
> true, then it's likely worth a warning.
> 
> Maybe even relax #2 to just check for unicode characters above a certain
> barrier where RTL languages live. I think everyone will agree that if there
> are unicode CCs and no other unicode characters in that same line, it's likely
> not a legitimate use of control characters.

If you are worried about malicious patches, then it should be easy for
attackers to add some RTL characters and escape the check...

Best regards,
								Pavel
-- 
http://www.livejournal.com/~pavelmachek

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: b4: unicode control characters -- warn or remove?
  2021-11-01 20:49       ` Pavel Machek
@ 2021-11-01 21:02         ` Konstantin Ryabitsev
  0 siblings, 0 replies; 7+ messages in thread
From: Konstantin Ryabitsev @ 2021-11-01 21:02 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ævar Arnfjörð Bjarmason, Eric Wong, users, tools,
	git

[-- Attachment #1: Type: text/plain, Size: 1444 bytes --]

On Mon, Nov 01, 2021 at 09:49:14PM +0100, Pavel Machek wrote:
> > I think the following would be a sane check:
> > 
> > 1. are there unicode control characters (CCs) present?
> > 2. are there other characters from RTL languages present in the same line?
> > 
> > if both 1 && 2 are true, this is a legitimate use of Unicode CCs. If only 1 is
> > true, then it's likely worth a warning.
> > 
> > Maybe even relax #2 to just check for unicode characters above a certain
> > barrier where RTL languages live. I think everyone will agree that if there
> > are unicode CCs and no other unicode characters in that same line, it's likely
> > not a legitimate use of control characters.
> 
> If you are worried about malicious patches, then it should be easy for
> attackers to add some RTL characters and escape the check...

Well, the point of this attack was to trick the reviewer into accepting code
that the compiler would treat differently (e.g. something that looked to be
inside a comment block is actually outside of it).

So, if attackers include some actual RTL text, then the reviewer would no
longer be (as easily) tricked because there would be stuff other than just
invisible characters in the line of code.

This actually similar to how we treat unicode domains. Most browsers only
allow unicode domains when the entire domain name consists of unicode
characters. I suggest we take a similar approach.

-K

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: b4: unicode control characters -- warn or remove?
  2021-11-01 20:22     ` Konstantin Ryabitsev
  2021-11-01 20:49       ` Pavel Machek
@ 2021-11-02 14:09       ` Konstantin Ryabitsev
  1 sibling, 0 replies; 7+ messages in thread
From: Konstantin Ryabitsev @ 2021-11-02 14:09 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Eric Wong, users, tools, git

On Mon, Nov 01, 2021 at 04:22:20PM -0400, Konstantin Ryabitsev wrote:
> I think the following would be a sane check:
> 
> 1. are there unicode control characters (CCs) present?
> 2. are there other characters from RTL languages present in the same line?
> 
> if both 1 && 2 are true, this is a legitimate use of Unicode CCs. If only 1 is
> true, then it's likely worth a warning.

I implemented this solution in b4 master, so it should error out only when it
finds control characters without any "other letter" unicode character category
present in the same line (where Hebrew, Arabic, etc live). There's probably
still a way to take advantage of this, but hopefully it's a lot less trivial
now and less likely to go unnoticed by the reviewer.

The error message will point where it found the problem:

	WARNING: Message contains suspicious unicode control characters!
			 Subject: [PATCH 1/2] SPI: Add SPI driver for Sunplus SP7021
				Line: + /* ‮ } ⁦if (isAdmin)⁩ ⁦ begin admins only */
				-----------^
				Char: RIGHT-TO-LEFT OVERRIDE (0x202e)
			 If you are sure about this, rerun with the right flag to allow.

One can rerun with --allow-unicode-control-chars to override this.

-K

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-11-02 14:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20211101175020.5r4cwmy4qppi7dis@meerkat.local>
2021-11-01 19:09 ` b4: unicode control characters -- warn or remove? Eric Wong
2021-11-01 19:17   ` Konstantin Ryabitsev
2021-11-01 20:02   ` Ævar Arnfjörð Bjarmason
2021-11-01 20:22     ` Konstantin Ryabitsev
2021-11-01 20:49       ` Pavel Machek
2021-11-01 21:02         ` Konstantin Ryabitsev
2021-11-02 14:09       ` Konstantin Ryabitsev

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).