From mboxrd@z Thu Jan 1 00:00:00 1970 From: Peter Krefting Subject: Re: [PATCH v2 1/2] commit: reject invalid UTF-8 codepoints Date: Fri, 5 Jul 2013 13:51:03 +0100 (CET) Organization: /universe/earth/europe/norway/oslo Message-ID: References: <20130704171943.GA267700@vauxhall.crustytoothpaste.net> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Git Mailing List , gitster@pobox.com To: "brian m. carlson" X-From: git-owner@vger.kernel.org Fri Jul 05 14:51:21 2013 Return-path: Envelope-to: gcvg-git-2@plane.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Uv5Ts-0008EQ-81 for gcvg-git-2@plane.gmane.org; Fri, 05 Jul 2013 14:51:20 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932743Ab3GEMvQ (ORCPT ); Fri, 5 Jul 2013 08:51:16 -0400 Received: from upper-gw.cixit.se ([92.43.32.133]:39286 "EHLO mail.cixit.se" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1757405Ab3GEMvP (ORCPT ); Fri, 5 Jul 2013 08:51:15 -0400 Received: from ds9.cixit.se (peter@localhost [127.0.0.1]) by mail.cixit.se (8.14.3/8.14.3/Debian-9.4) with ESMTP id r65Cp3wT029477 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Fri, 5 Jul 2013 14:51:03 +0200 Received: from localhost (peter@localhost) by ds9.cixit.se (8.14.3/8.14.3/Submit) with ESMTP id r65Cp3Cl029474; Fri, 5 Jul 2013 14:51:03 +0200 X-Authentication-Warning: ds9.cixit.se: peter owned process doing -bs In-Reply-To: <20130704171943.GA267700@vauxhall.crustytoothpaste.net> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) Accept: text/plain X-Warning: Junk / bulk email will be reported X-Rating: This message is not to be eaten by humans X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.3.7 (mail.cixit.se [127.0.0.1]); Fri, 05 Jul 2013 14:51:03 +0200 (CEST) Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: brian m. carlson: > + /* U+FFFE and U+FFFF are guaranteed non-characters. */ > + if ((codepoint & 0x1ffffe) == 0xfffe) > + return bad_offset; I missed this the first time around: All Unicode characters whose lower 16-bits are FFFE or FFFF are non-characters, so you can re-write that to: /* U+xxFFFE and U+xxFFFF are guaranteed non-characters. */ if ((codepoint & 0xfffe) == 0xfffe) return bad_offset; Also, the range U+FDD0--U+FDEF are also non-characters, if you wish to be really pedantic. $ grep '^[0-9A-F].* FDD1 FDD2 FDD3 FDD4 FDD5 FDD6 FDD7 FDD8 FDD9 FDDA FDDB FDDC FDDD FDDE FDDF FDE0 FDE1 FDE2 FDE3 FDE4 FDE5 FDE6 FDE7 FDE8 FDE9 FDEA FDEB FDEC FDED FDEE FDEF FFFE FFFF 1FFFE 1FFFF 2FFFE 2FFFF 3FFFE 3FFFF 4FFFE 4FFFF 5FFFE 5FFFF 6FFFE 6FFFF 7FFFE 7FFFF 8FFFE 8FFFF 9FFFE 9FFFF AFFFE AFFFF BFFFE BFFFF CFFFE CFFFF DFFFE DFFFF EFFFE EFFFF FFFFE FFFFF 10FFFE 10FFFF -- \\// Peter - http://www.softwolves.pp.se/