From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-2.0 required=3.0 tests=AWL,BAD_ENC_HEADER,BAYES_00, BODY_8BITS,DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,PP_MIME_FAKE_ASCII_TEXT,RCVD_IN_DNSWL_MED,SPF_HELO_PASS, SPF_PASS shortcircuit=no autolearn=no autolearn_force=no version=3.4.1 Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 5A1CC1F453 for ; Mon, 15 Oct 2018 11:55:14 +0000 (UTC) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:subject:to:cc:references:from:message-id:date :mime-version:in-reply-to:content-type; q=dns; s=default; b=TPAQ XMQMvNZeXhUZrQZe4T5ZoVRvhiCURz4/EmSnhRmTlD1ak9zjGBdwLiTkTNYGz7+f Nwbzo4GCwwQ3IEV3NO3p0I59EM+R7zfYUa2ZIJ4cfr4Qu4GB/bVntmIqmpBDwxf7 zB5ucXh612K2HSdqV+vXRnZYjtAEH3Suvy4nx7A= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:subject:to:cc:references:from:message-id:date :mime-version:in-reply-to:content-type; s=default; bh=5APoPSSznv 6+IXhviHZFVkUSWlE=; b=CB/X0dbB4wi5uCIAfuggflBcUY1F+szOXGzW5pmTC9 OBY2Jlb0gDglRuBRn/r5t0Grcdme7JfrhozqwSMsiG27jJr14h8IaxM3t0DOe5CV aslGu4eljK0VV9u1mm34jv6UaFy6lF5srZrEEJxYIQGJODgzYkjiToZ/tWLa8OI0 E= Received: (qmail 46248 invoked by alias); 15 Oct 2018 11:55:11 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 44420 invoked by uid 89); 15 Oct 2018 11:55:10 -0000 Authentication-Results: sourceware.org; auth=none X-HELO: mout.kundenserver.de Subject: Re: [PATCH v5] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] To: Marko Myllynen , Rafal Luzynski , libc-alpha@sourceware.org, libc-locales@sourceware.org Cc: mfabian@redhat.com, "Dmitry V. Levin" , Volodymyr Lisivka , Max Kutny , danilo@gnome.org References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <165238610.582597.1539392357757@poczta.nazwa.pl> <1374aef3-4c16-b9cd-49a6-b6da9b1a9eeb@redhat.com> From: Egor Kobylkin Openpgp: preference=signencrypt Message-ID: <8206061b-4e9c-b366-85a4-93ef61687ca0@kobylkin.com> Date: Mon, 15 Oct 2018 13:54:53 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: <1374aef3-4c16-b9cd-49a6-b6da9b1a9eeb@redhat.com> Content-Type: multipart/mixed; boundary="------------854CBCEEDB262A895610EB51" This is a multi-part message in MIME format. --------------854CBCEEDB262A895610EB51 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit On 15.10.2018 13:04, Marko Myllynen wrote: > Hi, > > On 2018-10-13 19:58, Egor Kobylkin wrote: >> On 13.10.2018 02:59, Rafal Luzynski wrote: >> >>> Regarding the tests, I think there is no complete transliteration >>> test suite at the moment. Probably the only test is >>> localedata/bug-iconv-trans.c. You can also see the collation tests >>> placed in the same directory, they use those multiple *.UTF-8.in >>> files. >>> >>> You can skip the tests for now. >> >> First I though they could just be added but not all locales >> transliterate Umlauts so just extending the current test won't do as it >> will fail for those locales. > > I still think a one-time check against uconv(1) (part of Unicode's ICU > project) for discrepancies. Just an addition. I have changes a few constants to see whether localedata/bug-iconv-trans.c could be made to test cyrillic. Attached is the bug-iconv-trans-cyr.c that goes through in this form. I had to save it as UTF-8 instead of ISO-8859-15 for localedata/bug-iconv-trans.c. >>>> [...] diff -uNr a/localedata/locales/am_ET >>>> b/localedata/locales/am_ET --- a/localedata/locales/am_ET >>>> 2018-10-11 15:10:11.000000000 +0000 +++ b/localedata/locales/am_ET >>>> 2018-10-11 15:10:43.000000000 +0000 @@ -1394,6 +1394,7 @@ >>>> >>>> +include >>>> "translit_cyrillic";"" translit_end % END LC_CTYPE >>> >>> Shouldn't “include "translit_cyrillic";""” be placed before the >>> custom rules, together with other includes? The same in more files, >>> I will not mention them all. >> >> If I recall correctly it is because of the >> "translit_end >> END LC_CTYPE" >> part at the end of the translit_cyrillic. This way it works for any >> locale, regardless whether it has translit itself or not. And being at >> the end it does not supersede any previous transliteration that may be >> there for a reason. > > I suspect one problem would be that the latter rule wins, so if there > are some locale-specific rules than possible translit_* inclusions would > override them if not included before the locale-specific rules. What is the best way forward here? Can somebody make an explicit suggestion on how to change the current approach if needed? Bests, Egor --------------854CBCEEDB262A895610EB51 Content-Type: text/x-csrc; name="bug-iconv-trans-cyr.c" Content-Transfer-Encoding: 8bit Content-Disposition: attachment; filename="bug-iconv-trans-cyr.c" #include #include #include #include int main (void) { iconv_t cd; const char str[] = "CyrillicLetters_ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’"; const char expected[] = "CyrillicLetters_YODJG`YEZ`IYIJL`N`TSHK`U`DHABVGDEZHZIJKLMNOPRSTUUFHCCHSHSHHA`Y`E`YUYAabvgdezhzijklmnoprstuufhcchshshh``y`e`yuyayodjg`yez`iyijl`n`tshk`u`dhO`o`FHfhYHyhE`e`G`g`GHghGHghZH`zh`K`k`K`k`N`n`NGngP`p`O`o`C`C`T`t`UuH`h`TCZtczSH`SH`CH`ch`CH`ch`iZH`zh`CH`ch`A`a`A`a`E`e`A`a`ZH`zh`Z`z`Z`z`I`i`O`o`O`o`U`u`U`u`CH`ch`Y`y`'"; char *inptr = (char *) str; size_t inlen = strlen (str) + 1; char outbuf[500]; char *outptr = outbuf; size_t outlen = sizeof (outbuf); int result = 0; size_t n; if (setlocale (LC_ALL, "de_DE.UTF-8") == NULL) { puts ("setlocale failed"); return 1; } cd = iconv_open ("ANSI_X3.4-1968//TRANSLIT", "UTF-8"); if (cd == (iconv_t) -1) { puts ("iconv_open failed"); return 1; } n = iconv (cd, &inptr, &inlen, &outptr, &outlen); if (n != 174) { if (n == (size_t) -1) printf ("iconv() returned error: %m\n"); else printf ("iconv() returned %Zd, expected 7\n", n); result = 1; } if (inlen != 0) { puts ("not all input consumed"); result = 1; } else if (inptr - str != strlen (str) + 1) { printf ("inptr wrong, advanced by %td\n", inptr - str); result = 1; } if (memcmp (outbuf, expected, sizeof (expected)) != 0) { printf ("result wrong: \"%.*s\", expected: \"%s\"\n", (int) (sizeof (outbuf) - outlen), outbuf, expected); result = 1; } else if (outlen != sizeof (outbuf) - sizeof (expected)) { printf ("outlen wrong: %Zd, expected %Zd\n", outlen, sizeof (outbuf) - 15); result = 1; } else printf ("output is \"%s\" which is OK\n", outbuf); return result; } --------------854CBCEEDB262A895610EB51--