From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.3 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED, SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.1 Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 7DFF91F97E for ; Fri, 5 Oct 2018 20:47:38 +0000 (UTC) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:subject:to:cc:references:from:message-id:date :mime-version:in-reply-to:content-type :content-transfer-encoding; q=dns; s=default; b=E4Yzs8yaVQNJq9ud GGr/uZAl8LYFMNklBHv+7NoYth+wkd9CiGTrXT1M170z0QTol7Q1EXidZgwhph/v um642l18IPPOTYTAHhH4YOTC6c9REsUcHaV3HsahknhOzMOuS+Ewc6LFFGW6ug7B pf1h4oVje6C7NphHD2fuuiTK9Y0= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:subject:to:cc:references:from:message-id:date :mime-version:in-reply-to:content-type :content-transfer-encoding; s=default; bh=PDongO7eOcPHfqNovSPk3/ wuQbU=; b=TpKQR06dfx/8N5aXcDbvW7bo4i5gXDueJLEzXdohufS7Y3uyn/aVRg aJTp/lE2HPsiIheE2e6i8kF4ktUv29JrxrjYydCaKlBrWXlp5pQ0UDKROfco6pHE fFapQZdbFNy7y++F4UrNBN9DSHBR5BegyTsqcsunfDL6wlXlI3jYo= Received: (qmail 75205 invoked by alias); 5 Oct 2018 20:47:33 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 75168 invoked by uid 89); 5 Oct 2018 20:47:32 -0000 Authentication-Results: sourceware.org; auth=none X-HELO: mout.kundenserver.de Subject: Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29 To: Marko Myllynen , Rafal Luzynski , Keld Simonsen Cc: libc-alpha@sourceware.org, libc-locales@sourceware.org, "Dmitry V. Levin" , Volodymyr Lisivka , Carlos O'Donell , Max Kutny , danilo@gnome.org References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <16e785f3-2e9f-ceb2-698f-dc33c91a5d5e@kobylkin.com> <20181003091949.GA21486@rap.rap.dk> <21d872b2-613e-d1f5-26c0-baa4b5721df9@kobylkin.com> <1485772360.805333.1538731225156@poczta.nazwa.pl> <69e26cab-810e-824b-3b16-b75ac44d8b0c@redhat.com> From: Egor Kobylkin Openpgp: preference=signencrypt Message-ID: Date: Fri, 5 Oct 2018 22:47:09 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: <69e26cab-810e-824b-3b16-b75ac44d8b0c@redhat.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit After some kind help from Marko in the offline discussion I realized the multi/single character approach I originally took was against the of the iconv(1) logic anyway. So there is no harm in dropping it and adopting Marko's suggestion instead. I will do so and will resubmit the patch with ISO 9:1995/GOST 7.79 System A + fallback to GOST 7.79 System B (for ASCII). However this doesn't resolve the issue for ASCII part being different for various locales. Again, I am offering the locale maintainers to let me know if they want to 1) adopt the one I am supplying, 2) write their own or 3) ignore the patch altogether. Your feedback is appreciated! This is the relevant part that helped: > The first part (ISO-8859-15 or ASCII) defines the target encoding for > iconv(1). //TRANSLIT is described in the iconv(1) man page as: > > If the string //TRANSLIT is appended to to-encoding, characters > being converted are transliterated when needed and possible. This > means that when a character cannot be represented in the target > character set, it can be approximated through one or sev‐ eral > similar looking characters. Characters that are outside of the > target character set and cannot be transliterated are replaced > with a question mark (?) in the output. > > So in the above examples, iconv(1) encounters the character U+0428 > which is not part of either of the target encoding and since > //TRANSLIT is specified, iconv(1) tries transliteration according to > the rules defined above, in case of ASCII U+0160 is not part of the > target encoding so the next alternative is used. Bests, Egor Kobylkin On 05.10.2018 14:21, Marko Myllynen wrote: > Hi, > > The scheme I proposed would also be ASCII compatible; consider this > example: > > % CYRILLIC CAPITAL LETTER SHA "";"" > > "printf \\u0428\\n | iconv -f UTF-8 -t ISO-8859-15//TRANSLIT | iconv > -f ISO-8859-15 -t UTF-8" would produce Š as per System A and "printf > \\u0428\\n | iconv -f UTF-8 -t ASCII//TRANSLIT" would produce Sh as > per System B. > > Thanks, > > On 2018-10-05 15:00, Egor Kobylkin wrote: >> Hi Marko, >> >> I have chosen the System B because it is ASCII compartible. System >> A is not ASCII compartible (diacritics in target). >> >> https://en.wikipedia.org/wiki/ISO_9#ISO_9:1995,_or_GOST_7.79_System_A >> >> >> "GOST 7.79 contains two transliteration tables. >> >> System A one Cyrillic character to one Latin character, some with >> diacritics – identical to ISO 9:1995 >> >> System B one Cyrillic character to one or many Latin characters >> without diacritics " Hope this helps, Egor >> >> On 05.10.2018 13:54, Marko Myllynen wrote: >>> Hi, >>> >>> Would it make sense to first use ISO 9:1995/GOST 7.79 System A if >>> possible and if not, then fall back to GOST 7.79 System B? >>> >>> Implementation-wise current translit_* files have few examples >>> where a non-ASCII transliteration is tried first before an ASCII >>> fallback. These examples are from translit_neutral: >>> >>> % NARROW NO-BREAK SPACE ; % REVERSED >>> TRIPLE PRIME >>> "";"" >>> >>> Thanks, >>> >>> On 2018-10-05 13:29, Egor Kobylkin wrote: >>>> Keld,Marko,Rafal, other locale maintainers, >>>> >>>> this all is written with having in mind a minimal viable fix >>>> for this bug asap. I want to avoid wasting maintainers time >>>> getting into fundamental discussions here (although for >>>> perfectly good reasons). >>>> >>>> I see three options: 1. those locale maintainers that are fine >>>> with using ISO 9:1995/GOST_7.79_System_B cyrillic >>>> transliteration table (Ru) include it in their locales (see >>>> attached screenshot of the table). 2. those that that want to >>>> have a differing table can create their own variety based on >>>> the spreadsheet I have prepared >>>> https://sourceware.org/bugzilla/attachment.cgi?id=8590 and >>>> include it in this patch. 3. those that want to omit a >>>> cyrillic transliteration altogether for now state so and just >>>> carry over the bug #2872 from the year 2006. >>>> >>>> Does this make sense to you? >>>> >>>> Just to be super clear on this: the patch is a stopgap _ASCII_ >>>> transliteration table. ASCII being AMERICAN Standard Code for >>>> Information Interchange, that is obviously orthogonal to any >>>> transliteration rule of other countries. As such it is not >>>> explicitly targeting transliteration standards of any country. >>>> >>>> The fact that the patch is reflecting Russian variety of ISO >>>> 9:1995/GOST_7.79_System_B is because a) ISO >>>> 9:1995/GOST_7.79_System_B is available and can be helpful to a >>>> majority of cyrillic users b) I have access to it including >>>> via being proficient in Russian. >>>> >>>> It is offered to all the respective locale maintainers as a >>>> stopgap solution. Stopgap in the sense that it is better to >>>> have some transliteration than not to have any at all and >>>> carry over the bug from 2006. That it may be a somewhat >>>> officially correct transliteration for ru_RU is a bonus. In >>>> that sense I would dub the discussion on the correctness for >>>> other languages "offtopic". Let me know if this is not OK. >>>> >>>> You are all are correctly mentioning the deficiencies of this >>>> approach. However, I couldn't find a better straightforward >>>> approach as of yet. Happy to hear from you as on how this >>>> could be handled. >>>> >>>> There is a danger of being caught in the web of >>>> language/country differences. I propose just pruning the >>>> locales that are not comfortable including this current table. >>>> We can address possible solutions in the second wave of >>>> patching. >>>> >>>> I am vary of getting into discussions on specific country >>>> variants just because of the sheer complexity of this topic. >>>> It is probably better addressed by respective maintainers of >>>> their locales. I do not see a "one fits all" solution in this >>>> first wave possible. >>>> >>>> I would like to have this "three options plan of action" >>>> vetted first and then we could go to the specific detail. >>>> (Like, for instance, what characters should be included in to >>>> the table, and in which transliteration form.) >>>> >>>> I am looking forward to your reply, Egor Kobylkin >>>> >>>> P.S. specifically as to how address languages other than Ru >>>> included in GOST_7.79_System_B: we can take the first option >>>> left to right from that table (Ru,By,Uk,Bg,Mk). Then it will >>>> technically work for all those locales/languages but with >>>> errors where Ru supersedes their own variants. >>>> >>>> >>>> On 05.10.2018 11:20, Rafal Luzynski wrote: >>>>> 3.10.2018 11:32 Egor Kobylkin wrote: >>>>>> >>>>>> On 03.10.2018 11:19, Keld Simonsen wrote: >>>>>>> Hi >>>>>>> >>>>>>> Please note that translitteration of Cyrillic to latin >>>>>>> is not universal. There are different schemes for for >>>>>>> example German, English and Danish, and there is also an >>>>>>> ISO standard for it. >>>>>> >>>>>> Thanks for your feedback, Keld! >>>>>> >>>>>> Could the locale maintainers that wouldn't like to include >>>>>> this patch explicitly state so here? >>>>> >>>>> I think it is about me so I must reply. I am sorry about >>>>> that and the sole reason is my lack of time. I'm just a >>>>> volunteer here, that means it's not my regular job to work >>>>> on locale data nor anything in glibc nor in any other open >>>>> source project. I do these things only in my free time >>>>> which I don't have much. Of course you will see my >>>>> contributions here and there but they are either trivial or >>>>> take me months to complete. Your patches are on my radar but >>>>> I can't tell any ETA for them. Of course, there are other >>>>> people around here and they are all welcome to come and >>>>> join. >>>>> >>>>>> That is: - In the case that there is a different preferred >>>>>> cyrillic transliteration table for any specific locale >>>>>> their maintainers may want to point me to it so I can >>>>>> supply a separate table/patch. - Or they could state >>>>>> explicitly that for some reason they would like to exclude >>>>>> their locale from the patch for a default cyrillic >>>>>> transliteration altogether. >>>>> >>>>> As Keld wrote, there are probably separate rules for every >>>>> language so I don't think you should treat your rules as >>>>> universal and include them in every locale. At first sight, >>>>> it seems to me they work only for English (as a destination >>>>> locale). Also, although it is called "transliteration from >>>>> Cyrillic" it seems that it covers only Russian alphabet. What >>>>> about other languages which use Cyrillic alphabet but add >>>>> their own diacritic characters? Think about Belarusian, >>>>> Ukrainian, Serbian, Chechen, Chuvash, Mari, Ossetian, Yakut, >>>>> Tatar, and more. What about languages which use Cyrillic >>>>> alphabet but transliterate their respective letters in a >>>>> different way than Russian? For example, Russian "Ъ" is (I >>>>> think) usually skipped in transliteration, I think you >>>>> propose "``", but when transliterating from Bulgarian they >>>>> usually transliterate this as "ă". >>>>> >>>>> Few remarks: >>>>> >>>>> * I think you transliterate "щ" as "shh", wouldn't "shch" be >>>>> better? * You transliterate "ц" as "cz", wouldn't "ts" be >>>>> better? By the way, in Polish language "cz" is a correct >>>>> transliteration of "ч". * You transliterate "й" as "j", this >>>>> is fine in many languages but wouldn't "y" be better in >>>>> English? * In case of "е": how will you know if it is >>>>> correct to transliterate it to "e" or "ie" or "je" or "ye"? >>>>> >>>>> These remarks are obviously incomplete, your patch deserves >>>>> much more attention to review. >>>>> >>>>> Best regards, >>>>> >>>>> Rafal >>>>> >>>> >>> >>> >> > >