From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-3.6 required=3.0 tests=AWL,BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, NICE_REPLY_A,RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id CD13E1F4B4 for ; Mon, 14 Sep 2020 19:46:12 +0000 (UTC) Received: from localhost ([::1]:56854 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kHuQV-0000ye-GV for normalperson@yhbt.net; Mon, 14 Sep 2020 15:46:11 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:50714) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kHuQR-0000vo-19 for bug-gnulib@gnu.org; Mon, 14 Sep 2020 15:46:07 -0400 Received: from mo4-p00-ob.smtp.rzone.de ([81.169.146.161]:17706) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kHuQL-000861-NW for bug-gnulib@gnu.org; Mon, 14 Sep 2020 15:46:06 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; t=1600112759; s=strato-dkim-0002; d=clisp.org; h=References:In-Reply-To:Message-ID:Date:Subject:To:From: X-RZG-CLASS-ID:X-RZG-AUTH:From:Subject:Sender; bh=XjQQtf0n93B9zE5nn/+fDhNMJz/4VbRNomH7RjPqm60=; b=gArX57NaIModpohUoL2wbKLs4kXRXzgurUBBFZ5iCl0+xbs+wliEDuM/zYrY0M0Xic 3vbYu65Pbs/lLsVdSjUFLcE1fT4aO5lcy2+xSChljce896DAKu4+S5uvcksv/BrJZcFN 0eKB47CGA51reBWf9CjzVNP9Xt+1rjlYFE/wJxq4wXAf1ml3J97jP6pptEqfyvZDzdaB cL6Ax2Cnk2xZdwDaeCuWoVM3KSj9kg9sDTGyz+7z/XvZl7Bz3j2F8vjcleee4sGxcSAJ BR7AJ0y2Xeuqq1PAZLMCRqNDwQXiwzHIQamw++/VoRnPBHOkjPqcbG1FPN4TV4UHshpB kJXQ== X-RZG-AUTH: ":Ln4Re0+Ic/6oZXR1YgKryK8brlshOcZlIWs+iCP5vnk6shH+AHjwLuWOHqfyyPs=" X-RZG-CLASS-ID: mo00 Received: from bruno.haible.de by smtp.strato.de (RZmta 46.10.7 DYNA|AUTH) with ESMTPSA id z05f0fw8EJjgkaK (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (curve X9_62_prime256v1 with 256 ECDH bits, eq. 3072 bits RSA)) (Client did not present a certificate); Mon, 14 Sep 2020 21:45:42 +0200 (CEST) From: Bruno Haible To: bug-gnulib@gnu.org, Jose Kahan Subject: Re: Question about C sscanf and unicode Date: Mon, 14 Sep 2020 21:45:41 +0200 Message-ID: <2330305.RJ93V7QO3L@omega> User-Agent: KMail/5.1.3 (Linux/4.4.0-189-generic; KDE/5.18.0; x86_64; ; ) In-Reply-To: <20200914175847.GA13143@w3.org> References: <20200914175847.GA13143@w3.org> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" Received-SPF: none client-ip=81.169.146.161; envelope-from=bruno@clisp.org; helo=mo4-p00-ob.smtp.rzone.de X-detected-operating-system: by eggs.gnu.org: First seen = 2020/09/14 15:45:59 X-ACL-Warn: Detected OS = Linux 2.2.x-3.x [generic] [fuzzy] X-Spam_score_int: -27 X-Spam_score: -2.8 X-Spam_bar: -- X-Spam_report: (-2.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_PASS=-0.001, SPF_NONE=0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: bug-gnulib@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Gnulib discussion list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnulib-bounces+normalperson=yhbt.net@gnu.org Sender: "bug-gnulib" Jose Kahan wrote: > I am a contributor to an mail hypertext archiving system called > hypermail [1], which is written in C. > > Recently a bug was raised that one of its parsers had problems > when the input string had a nbsp. As you may imagine from my subject, > this is because that parser uses sscanf and the nbsp corresponds > to UTF-8 U+00A0 character: > > urlscan = sscanf(inputp, "%255[^] )<>\"\'\n[\t\\]", urlbuff); > > o you know if there's an sscanf function that is UTF-8 aware? According to POSIX [1], the %l[ directive parses multibyte characters. If you set the locale to a UTF-8 locale - such as through setlocale (LC_CTYPE, "en_US.UTF-8"); - you should be able to achieve this, at least on glibc systems. However, this is complex code, and I doubt all platforms get this right correctly. We found 20 bugs in *printf implementations on various platforms. I wouldn't be surprised if there were 10 bugs in *scanf implementations, and this part is among the hairiest in sscanf. > - Convert the input string to wchar and use swscanf instead. This is what I would suggest, because - swscanf is portable enough [2]. - Parsing sequences of wide-characters in a wide-character string is more likely to be correctly implemented everywhere. > seeing if we can replace the sscanf eventually by regexps. Anyone has experience with this? Bruno [1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/fscanf.html [2] https://www.gnu.org/software/gnulib/manual/html_node/swscanf.html