From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-2.8 required=3.0 tests=AWL,BAD_ENC_HEADER,BAYES_00, BODY_8BITS,DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.1 Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 7AEFF1F97E for ; Wed, 10 Oct 2018 12:20:09 +0000 (UTC) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:subject:to:cc:references:from:message-id:date :mime-version:in-reply-to:content-type :content-transfer-encoding; q=dns; s=default; b=ENo4qseigWohOd9J UyMzkOzoEsm+Kx77KuBvSjXSg4PoRC49pWnQfA1od4U0DfasyM1ZD0MVyOIXCAN+ vZdbNFK5h35W0wRgI1qDP09evszHGKLrdu77lYFN4kIbsWSG5qsSGOQgB5k1lD/l pkwAfwPjlkMA8qiXcnvIIRfCW3s= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:subject:to:cc:references:from:message-id:date :mime-version:in-reply-to:content-type :content-transfer-encoding; s=default; bh=kzPe9Ox12vXzrmnu32Tmt/ menvU=; b=Oz7JXw7fpu1XGzCQZhi97bHVvaKh7pYCpuUZ2eVNSJL2IbXXQQSubB yskJw+mDGgs/hy07urU86M2jKtcl+I3I3hweva8M/gFp4C9jtbAhYayl2LJnYYgd 6+TvWlFBAah7kK0EOl2xGghr+2+hZmlnHnGHUzydyLWYGcYnwONPU= Received: (qmail 35360 invoked by alias); 10 Oct 2018 12:20:02 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 35325 invoked by uid 89); 10 Oct 2018 12:20:02 -0000 Authentication-Results: sourceware.org; auth=none X-HELO: mout.kundenserver.de Subject: Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29 To: Marko Myllynen , Rafal Luzynski Cc: Keld Simonsen , libc-alpha@sourceware.org, libc-locales@sourceware.org, "Dmitry V. Levin" , Volodymyr Lisivka , Carlos O'Donell , Max Kutny , danilo@gnome.org References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20181003091949.GA21486@rap.rap.dk> <21d872b2-613e-d1f5-26c0-baa4b5721df9@kobylkin.com> <1485772360.805333.1538731225156@poczta.nazwa.pl> <69e26cab-810e-824b-3b16-b75ac44d8b0c@redhat.com> <246390048.827062.1539037422672@poczta.nazwa.pl> <4db1ce91-3184-cf45-01c5-80667fc4cf65@kobylkin.com> <1198370378.413479.1539123456488@poczta.nazwa.pl> <70c29e42-0fd3-4f10-fafb-44d67190d870@kobylkin.com> <9edcf6f2-607c-91ac-8eaf-ffbc973fe597@redhat.com> From: Egor Kobylkin Openpgp: preference=signencrypt Message-ID: <3f50cc1f-9493-0611-3478-0394ecb6b37e@kobylkin.com> Date: Wed, 10 Oct 2018 14:19:37 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: <9edcf6f2-607c-91ac-8eaf-ffbc973fe597@redhat.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit On 10.10.2018 13:22, Marko Myllynen wrote: >> correct link https://sourceware.org/bugzilla/attachment.cgi?id=11303 > > Although I haven't checked every rule this in general looks very good > (but see below). > Not sure do we want to add the few missing characters > mentioned at https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode, > e.g., one instantly notices that U+0400 is missing. (I wouldn't add at > least initially the more exotic characters, like the historic ones, > though.) Perhaps filing a bug or two for these cases for separate > consideration would be ok. The question here is what should serve as their transliteration and transcription? They are not covered by ISO9 neither by GOST 7.79. So maybe it would be reasonable to assume there is no notable occurrence of those anywhere? Anyway I am happy to include your specific suggestions for all and any Unicode quartets in this form: [Cyrillic Unicode ; ISO9 Latin Transliteration (System A) as Unicode ; Transcription (System B) as (mulitcharacter)ASCII ; name to put in %COMMENT ]. > >> On 10.10.2018 00:40, Egor Kobylkin wrote: >>> On 10.10.2018 00:17, Rafal Luzynski wrote: >>>> 9.10.2018 20:34 Egor Kobylkin wrote: >>>>> >>>>> The culprits were the "" around the "" () and >>>>> "" (). >>>>> It works now with >>>>> % CYRILLIC UNDEFINED >>>>> ;"" >>>>> % CYRILLIC UNDEFINED >>>>> ;"" >>>>> >>>>> [...] >>>> >>>> I wonder why you need Cyrillic U with acute, and why you comment it >>>> as "undefined" at all. I know that any Cyrillic vowel may appear with >>>> an acute accent but "the diacritic is used only in dictionaries, children's >>>> books, resources for foreign-language learners (...)". [1] So maybe >>>> all vowels with an acute accent should be handled (which I think is fine) >>>> rather than just U. >>> >>> I have just taken the https://en.wikipedia.org/wiki/ISO_9 table and >>> implemented it on Marko's suggestion. Personally I have no opinion on >>> what letters should be included and under what name. These funny Us just >>> happened to be in the ISO9 table. >>> >>> There is no codepoint and no name for and >>> in Unicode. That’s why its coming through that way from my worksheet as >>> it does a reverse lookup on the names based on the Unicode codepoints. >>> >>> Manually we can change it to whatever you’d suggest in the >>> translit_cyrillic. I just don’t know the right name. > > I'm not sure this will work, no existing rule in translit_* files > contain two characters, I'd assume that the rule for U+0423 is applied > first and then the below rule is never used. > > % CYRILLIC UNDEFINED > ;"" > > Perhaps this should be commented out or removed altogether if it's not > working as intended. here is a result of my test on https://sourceware.org/bugzilla/attachment.cgi?id=11304 U0423 0301-У́ -> U0423 0301-U U0443 0301-у́ -> U0443 0301-u So yes, they are not processed. I would drop them to not to have special cases. But I am also fine with keeping them because all work is done already. Result: CYRILLIC RUSSIAN S``esh` eshhyo e`tih myagkih francuzskih bulok, da vypej zhe chayu. SA`ESH` ESHHYO E`TIH MYAGKIH FRANCUZSKIH BULOK? DA VYPEJ ZHE CHAYU! CYRILLIC COMPLETE U0401-YO U0402-DJ U0403-G` U0404-Ye U0405-Z` U0406-I U0407-Yi U0408-J U0409-L` U040A-N` U040B-TSH U040C-K` U040E-U` U040F-Dh U0410-A U0411-B U0412-V U0413-G U0414-D U0415-E U0416-ZH U0417-Z U0418-I U0419-J U041A-K U041B-L U041C-M U041D-N U041E-O U041F-P U0420-R U0421-S U0422-T U0423-U U0423 0301-U U0424-F U0425-H U0426-C U0427-CH U0428-SH U0429-SHH U042A-`` U042B-Y U042C-` U042D-E` U042E-YU U042F-YA U0430-a U0431-b U0432-v U0433-g U0434-d U0435-e U0436-zh U0437-z U0438-i U0439-j U043A-k U043B-l U043C-m U043D-n U043E-o U043F-p U0440-r U0441-s U0442-t U0443-u U0443 0301-u U0444-f U0445-h U0446-c U0447-ch U0448-sh U0449-shh U044A-A` U044B-y U044C-` U044D-e` U044E-yu U044F-ya U0451-yo U0452-dj U0453-g` U0454-ye U0455-z` U0456-i U0457-yi U0458-j U0459-l` U045A-n` U045B-tsh U045C-k` U045E-u` U045F-dh U046A-O` U046B-o` U0472-Fh U0473-fh U0474-Yh U0475-yh U048C-E` U048D-e` U0490-G` U0491-g` U0492-GH U0493-gh U0494-GH U0495-gh U0496-ZH` U0497-zh` U049A-K` U049B-k` U049E-K` U049F-k` U04A2-N` U04A3-n` U04A4-NG U04A5-ng U04A6-P` U04A7-p` U04A8-O` U04A9-o` U04AA-C` U04AB-C` U04AC-T` U04AD-t` U04AE-U U04AF-u U04B2-H` U04B3-h` U04B4-TCZ U04B5-tcz U04BA-SH` U04BB-SH` U04BC-CH` U04BD-ch` U04BE-CH` U04BF-ch` U04C0-i U04C1-ZH` U04C2-zh` U04CB-CH` U04CC-ch` U04D0-A` U04D1-a` U04D2-A` U04D3-a` U04D6-E` U04D7-e` U04D8-A` U04D9-a` U04DC-ZH` U04DD-zh` U04DE-Z` U04DF-z` U04E0-Z` U04E1-z` U04E4-I` U04E5-i` U04E6-O` U04E7-o` U04E8-O` U04E9-o` U04F0-U` U04F1-u` U04F2-U` U04F3-u` U04F4-CH` U04F5-ch` U04F8-Y` U04F9-y` U2019-' Source: CYRILLIC RUSSIAN Съешь ещё этих мягких французских булок, да выпей же чаю. СЪЕШЬ ЕЩЁ ЭТИХ МЯГКИХ ФРАНЦУЗСКИХ БУЛОК? ДА ВЫПЕЙ ЖЕ ЧАЮ! CYRILLIC COMPLETE U0401-Ё U0402-Ђ U0403-Ѓ U0404-Є U0405-Ѕ U0406-І U0407-Ї U0408-Ј U0409-Љ U040A-Њ U040B-Ћ U040C-Ќ U040E-Ў U040F-Џ U0410-А U0411-Б U0412-В U0413-Г U0414-Д U0415-Е U0416-Ж U0417-З U0418-И U0419-Й U041A-К U041B-Л U041C-М U041D-Н U041E-О U041F-П U0420-Р U0421-С U0422-Т U0423-У U0423 0301-У́ U0424-Ф U0425-Х U0426-Ц U0427-Ч U0428-Ш U0429-Щ U042A-ъ U042B-Ы U042C-ь U042D-Э U042E-Ю U042F-Я U0430-а U0431-б U0432-в U0433-г U0434-д U0435-е U0436-ж U0437-з U0438-и U0439-й U043A-к U043B-л U043C-м U043D-н U043E-о U043F-п U0440-р U0441-с U0442-т U0443-у U0443 0301-у́ U0444-ф U0445-х U0446-ц U0447-ч U0448-ш U0449-щ U044A-Ъ U044B-ы U044C-Ь U044D-э U044E-ю U044F-я U0451-ё U0452-ђ U0453-ѓ U0454-є U0455-ѕ U0456-і U0457-ї U0458-ј U0459-љ U045A-њ U045B-ћ U045C-ќ U045E-ў U045F-џ U046A-Ѫ U046B-ѫ U0472-Ѳ U0473-ѳ U0474-Ѵ U0475-ѵ U048C-Ҍ U048D-ҍ U0490-Ґ U0491-ґ U0492-Ғ U0493-ғ U0494-Ҕ U0495-ҕ U0496-Җ U0497-җ U049A-Қ U049B-қ U049E-Ҟ U049F-ҟ U04A2-Ң U04A3-ң U04A4-Ҥ U04A5-ҥ U04A6-Ҧ U04A7-ҧ U04A8-Ҩ U04A9-ҩ U04AA-Ҫ U04AB-ҫ U04AC-Ҭ U04AD-ҭ U04AE-Ү U04AF-ү U04B2-Ҳ U04B3-ҳ U04B4-Ҵ U04B5-ҵ U04BA-Һ U04BB-һ U04BC-Ҽ U04BD-ҽ U04BE-Ҿ U04BF-ҿ U04C0-Ӏ U04C1-Ӂ U04C2-ӂ U04CB-Ӌ U04CC-ӌ U04D0-Ӑ U04D1-ӑ U04D2-Ӓ U04D3-ӓ U04D6-Ӗ U04D7-ӗ U04D8-Ә U04D9-ә U04DC-Ӝ U04DD-ӝ U04DE-Ӟ U04DF-ӟ U04E0-Ӡ U04E1-ӡ U04E4-Ӥ U04E5-ӥ U04E6-Ӧ U04E7-ӧ U04E8-Ө U04E9-ө U04F0-Ӱ U04F1-ӱ U04F2-Ӳ U04F3-ӳ U04F4-Ӵ U04F5-ӵ U04F8-Ӹ U04F9-ӹ U2019-’