From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.8 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED, SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.1 Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id EED871F453 for ; Fri, 5 Oct 2018 11:54:22 +0000 (UTC) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:reply-to:subject:to:cc:references:from :message-id:date:mime-version:in-reply-to:content-type :content-transfer-encoding; q=dns; s=default; b=ko0VAQAzcLdI9vgR JY1Ojg0DLmCFljuDac8RpMOoTwezU1CksmqHXncmpL70ECATpDy+ASA1WAqDL5U/ AqnCT/WONa+DaMCth5L2X6i4uND0kuNcro94e2qxW3zPxGmNT7/UIuzI2NmB8mRj lYJwwYY+hSfDydL5fawvVY3qV2g= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:reply-to:subject:to:cc:references:from :message-id:date:mime-version:in-reply-to:content-type :content-transfer-encoding; s=default; bh=JkoQE+9mZkmW9Dh8haE40L LFjhY=; b=RF9LPUmvWUffbsAaqzLWDfoRwUPZUohubCWlkhFPHPEDRnQILL5foE jMSppPZKI1Rxrc0Cce0nf6p6IBqFT6gXD+B/ZqgW8IiaT+QADXu0CGCoVouMDr2r FIFmVvdPObDoUAXZbQTM2FQ+R9sggkVXUzBNPjq0/4llyUmRfDcGE= Received: (qmail 60968 invoked by alias); 5 Oct 2018 11:54:19 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 60945 invoked by uid 89); 5 Oct 2018 11:54:18 -0000 Authentication-Results: sourceware.org; auth=none X-HELO: mail-wm1-f68.google.com Reply-To: Marko Myllynen Subject: Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29 To: Egor Kobylkin , Rafal Luzynski , Keld Simonsen Cc: libc-alpha@sourceware.org, libc-locales@sourceware.org, "Dmitry V. Levin" , Volodymyr Lisivka , Carlos O'Donell , Max Kutny , danilo@gnome.org References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <16e785f3-2e9f-ceb2-698f-dc33c91a5d5e@kobylkin.com> <20181003091949.GA21486@rap.rap.dk> <21d872b2-613e-d1f5-26c0-baa4b5721df9@kobylkin.com> <1485772360.805333.1538731225156@poczta.nazwa.pl> From: Marko Myllynen Message-ID: Date: Fri, 5 Oct 2018 14:54:10 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Hi, Would it make sense to first use ISO 9:1995/GOST 7.79 System A if possible and if not, then fall back to GOST 7.79 System B? Implementation-wise current translit_* files have few examples where a non-ASCII transliteration is tried first before an ASCII fallback. These examples are from translit_neutral: % NARROW NO-BREAK SPACE ; % REVERSED TRIPLE PRIME "";"" Thanks, On 2018-10-05 13:29, Egor Kobylkin wrote: > Keld,Marko,Rafal, other locale maintainers, > > this all is written with having in mind a minimal viable fix for this > bug asap. I want to avoid wasting maintainers time getting into > fundamental discussions here (although for perfectly good reasons). > > I see three options: > 1. those locale maintainers that are fine with using ISO > 9:1995/GOST_7.79_System_B cyrillic transliteration table (Ru) include it > in their locales (see attached screenshot of the table). > 2. those that that want to have a differing table can create their own > variety based on the spreadsheet I have prepared > https://sourceware.org/bugzilla/attachment.cgi?id=8590 and include it in > this patch. > 3. those that want to omit a cyrillic transliteration altogether for now > state so and just carry over the bug #2872 from the year 2006. > > Does this make sense to you? > > Just to be super clear on this: the patch is a stopgap _ASCII_ > transliteration table. ASCII being AMERICAN Standard Code for > Information Interchange, that is obviously orthogonal to any > transliteration rule of other countries. As such it is not explicitly > targeting transliteration standards of any country. > > The fact that the patch is reflecting Russian variety of ISO > 9:1995/GOST_7.79_System_B is because a) ISO 9:1995/GOST_7.79_System_B is > available and can be helpful to a majority of cyrillic users b) I have > access to it including via being proficient in Russian. > > It is offered to all the respective locale maintainers as a stopgap > solution. Stopgap in the sense that it is better to have some > transliteration than not to have any at all and carry over the bug from > 2006. That it may be a somewhat officially correct transliteration for > ru_RU is a bonus. In that sense I would dub the discussion on the > correctness for other languages "offtopic". Let me know if this is not OK. > > You are all are correctly mentioning the deficiencies of this approach. > However, I couldn't find a better straightforward approach as of yet. > Happy to hear from you as on how this could be handled. > > There is a danger of being caught in the web of language/country > differences. I propose just pruning the locales that are not comfortable > including this current table. We can address possible solutions in the > second wave of patching. > > I am vary of getting into discussions on specific country variants just > because of the sheer complexity of this topic. It is probably better > addressed by respective maintainers of their locales. I do not see a > "one fits all" solution in this first wave possible. > > I would like to have this "three options plan of action" vetted first > and then we could go to the specific detail. (Like, for instance, what > characters should be included in to the table, and in which > transliteration form.) > > I am looking forward to your reply, > Egor Kobylkin > > P.S. specifically as to how address languages other than Ru included in > GOST_7.79_System_B: we can take the first option left to right from that > table (Ru,By,Uk,Bg,Mk). Then it will technically work for all those > locales/languages but with errors where Ru supersedes their own variants. > > > On 05.10.2018 11:20, Rafal Luzynski wrote: >> 3.10.2018 11:32 Egor Kobylkin wrote: >>> >>> On 03.10.2018 11:19, Keld Simonsen wrote: >>>> Hi >>>> >>>> Please note that translitteration of Cyrillic to latin is not universal. >>>> There are different schemes for for example German, English and Danish, and >>>> there is also an ISO standard for it. >>> >>> Thanks for your feedback, Keld! >>> >>> Could the locale maintainers that wouldn't like to include this patch >>> explicitly state so here? >> >> I think it is about me so I must reply. I am sorry about that and the sole >> reason is my lack of time. I'm just a volunteer here, that means it's not >> my regular job to work on locale data nor anything in glibc nor in any other >> open source project. I do these things only in my free time which I don't >> have much. Of course you will see my contributions here and there but they >> are either trivial or take me months to complete. Your patches are on my >> radar but I can't tell any ETA for them. Of course, there are other people >> around here and they are all welcome to come and join. >> >>> That is: >>> - In the case that there is a different preferred cyrillic >>> transliteration table for any specific locale their maintainers may want >>> to point me to it so I can supply a separate table/patch. >>> - Or they could state explicitly that for some reason they would like to >>> exclude their locale from the patch for a default cyrillic >>> transliteration altogether. >> >> As Keld wrote, there are probably separate rules for every language so >> I don't think you should treat your rules as universal and include them >> in every locale. At first sight, it seems to me they work only for English >> (as a destination locale). Also, although it is called "transliteration >> from Cyrillic" it seems that it covers only Russian alphabet. What about >> other languages which use Cyrillic alphabet but add their own diacritic >> characters? Think about Belarusian, Ukrainian, Serbian, Chechen, Chuvash, >> Mari, Ossetian, Yakut, Tatar, and more. What about languages which use >> Cyrillic alphabet but transliterate their respective letters in a different >> way than Russian? For example, Russian "Ъ" is (I think) usually skipped >> in transliteration, I think you propose "``", but when transliterating from >> Bulgarian they usually transliterate this as "ă". >> >> Few remarks: >> >> * I think you transliterate "щ" as "shh", wouldn't "shch" be better? >> * You transliterate "ц" as "cz", wouldn't "ts" be better? By the way, >> in Polish language "cz" is a correct transliteration of "ч". >> * You transliterate "й" as "j", this is fine in many languages but wouldn't >> "y" be better in English? >> * In case of "е": how will you know if it is correct to transliterate it >> to "e" or "ie" or "je" or "ye"? >> >> These remarks are obviously incomplete, your patch deserves much more >> attention to review. >> >> Best regards, >> >> Rafal >> > -- Marko Myllynen