From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-3.0 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,HK_RANDOM_FROM,HK_RANDOM_REPLYTO, MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 78B4D1F4B4 for ; Mon, 14 Sep 2020 18:06:25 +0000 (UTC) Received: from localhost ([::1]:55566 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kHsrw-00041e-7t for normalperson@yhbt.net; Mon, 14 Sep 2020 14:06:24 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:54302) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kHskp-0002Nn-Ls for bug-gnulib@gnu.org; Mon, 14 Sep 2020 13:59:03 -0400 Received: from raoul.w3.org ([128.30.52.128]:43634) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kHskn-0002i6-Q9 for bug-gnulib@gnu.org; Mon, 14 Sep 2020 13:59:03 -0400 Received: from mut38-7-78-226-234-114.fbx.proxad.net ([78.226.234.114] helo=kiribati.inrialpes.fr) by raoul.w3.org with esmtpsa (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1kHskc-0000ZK-59 for bug-gnulib@gnu.org; Mon, 14 Sep 2020 17:58:50 +0000 Received: by kiribati.inrialpes.fr (Postfix, from userid 50104) id 751BC638; Mon, 14 Sep 2020 19:58:47 +0200 (CEST) Date: Mon, 14 Sep 2020 19:58:47 +0200 From: jkjdll@w3.org To: bug-gnulib@gnu.org Subject: Question about C sscanf and unicode Message-ID: <20200914175847.GA13143@w3.org> MIME-Version: 1.0 X-Delivery: dynagel Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline Content-Transfer-Encoding: 8bit User-Agent: Mutt/1.10.1 (2018-07-13) Received-SPF: pass client-ip=128.30.52.128; envelope-from=kahan@w3.org; helo=raoul.w3.org X-detected-operating-system: by eggs.gnu.org: First seen = 2020/09/14 13:58:50 X-ACL-Warn: Detected OS = Linux 3.11 and newer X-Spam_score_int: -59 X-Spam_score: -6.0 X-Spam_bar: ------ X-Spam_report: (-6.0 / 5.0 requ) BAYES_00=-1.9, HK_RANDOM_FROM=0.892, HK_RANDOM_REPLYTO=0.001, RCVD_IN_DNSWL_HI=-5, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Mailman-Approved-At: Mon, 14 Sep 2020 14:06:21 -0400 X-BeenThere: bug-gnulib@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Gnulib discussion list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: jkjdll@w3.org Errors-To: bug-gnulib-bounces+normalperson=yhbt.net@gnu.org Sender: "bug-gnulib" Hi, I am a contributor to an mail hypertext archiving system called hypermail [1], which is written in C. Recently a bug was raised that one of its parsers had problems when the input string had a nbsp. As you may imagine from my subject, this is because that parser uses sscanf and the nbsp corresponds to UTF-8 U+00A0 character: urlscan = sscanf(inputp, "%255[^] )<>\"\'\n[\t\\]", urlbuff); o you know if there's an sscanf function that is UTF-8 aware? Or, if it doesn't exist, an alternative method to be able to solve this issue? For the moment I see two possibilities: - As the code is already using PCRE, replace all space chars by the simple 0x20, temporarily, while seeing if we can replace the sscanf eventually by regexps. - Convert the input string to wchar and use swscanf instead. If you have any input on the above, I'd appreciate it very much. Thank you in advance, --josé [1] https://github.com/hypermail-project/hypermail