From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-2.2 required=3.0 tests=BAD_ENC_HEADER,BAYES_00, BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.1 Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id DAECB1F97E for ; Thu, 11 Oct 2018 13:51:08 +0000 (UTC) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-type :content-transfer-encoding; q=dns; s=default; b=c54GMaCOpyHodInc D0mlAJtBlYU9YETh6XGM0aFvcneqUGnPjt22oQIkcQDBZk+lz5xyhwavS4CFl61r VB78kZ8hZl70rWmavJnQ2cgw04WwOMM3bXgO4J7nrExfCcxPy6mfhhhyzMkn+nT4 Rwdlmy3CHq7yJWl27kB1delqtCI= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-type :content-transfer-encoding; s=default; bh=eexdrWel4VN/lydgb8NhsO t2vjc=; b=QE9XJZeF8/UT0BjkPTM3zYGVFzF3+/eIG9451lUxOYX7oEL9s95UV8 xpkPQgHPfbsBXI7lGHzbkGTLNVxohds/4zNGFzC/bzCszvUzx2SAv9YxWF/yB2BD jF3Vi+/PhPepI3tfzbozFDduaOzblj9gYuqhIJ1FzzCtmqtUK3eF4= Received: (qmail 104218 invoked by alias); 11 Oct 2018 13:51:06 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 104197 invoked by uid 89); 11 Oct 2018 13:51:05 -0000 Authentication-Results: sourceware.org; auth=none X-HELO: mail-wr1-f66.google.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=MDECnCloG/MfVsv8zdRxEeMCrUqLYWn/tpJrA8gifBg=; b=REMCNh9cphr/wHoqYA694xaFF2UrHP8wkLmTfpR84vXfiy5Rd7kXQFfAc9Il9gG3O0 anVGhx2QE7a7CDGOT7I+T6/K65gaSuzVxEvGNnRpbbtQ5VGSxEavhBkLxGQ0VS2e0bHq /YXr8Mh23LIooPCrgl2/GoYvh0PAazQxy1jgwKTcaCUxPmjfhe1t+fsPREcc07DbskSN 31axGhU3FvavXzM+Wm1feSKeju1oGSbpXCyEeVM+/mZTDOts89uyX3cJrEv0sFUhwTKi 2vUASyGmtiMFGwWWvNJTizRS7XxCRAq6e/9wYlwt8aVQk4lypbm/LZzq4RJfEnaGhX2z f+Vg== MIME-Version: 1.0 References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <180516689.458569.1539255868196@poczta.nazwa.pl> In-Reply-To: <180516689.458569.1539255868196@poczta.nazwa.pl> From: Volodymyr Lisivka Date: Thu, 11 Oct 2018 16:50:46 +0300 Message-ID: Subject: Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v2 To: digitalfreak@lingonborough.com Cc: Egor Kobylkin , libc-alpha@sourceware.org, libc-locales@sourceware.org, mfabian@redhat.com, myllynen@redhat.com, ldv@altlinux.org, Max Kutny , danilo@gnome.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable =D1=87=D1=82, 11 =D0=B6=D0=BE=D0=B2=D1=82. 2018 =D0=BE 14:05 Rafal Luzynski= =D0=BF=D0=B8=D1=88=D0=B5: > > Thank you, Egor. I am looking at your patch and although I have > not yet finished, here are some remarks: > > First of all, I think that such a large patch should also include > the tests. Please see how automatic tests are performed in locale > data and write your own. > > 11.10.2018 00:29 Egor Kobylkin wrote: > > [...] > > From this patch I have excluded locales that already mention cyrillic o= r > > have a transliteration table for it: > > az_AZ > > iso14651_t1_common > > ky_KG > > mn_MN > > sr_RS > > tg_TJ > > tk_TM > > tt_RU > > uk_UA > > uz_UZ > > uz_UZ@cyrillic > > [...] > > I think that eventually we would like to include your translit_cyrillic > also in these locales because I assume that your rules should work good > for them as well, also should include more characters than the individual > language contributors took into account. It's very good idea. Transliteration in Ukrainian locale predates this work for about decade. It well tested. I also have automatic test cases, which I can adapt to current standard. Let's drop Russian transliteration rules and replace them with Ukrainian transliteration rules. I assume that Ukrainian rules should work good for them as well. Ukrainian language is the oldest and most developed language in Slavic family - last king of all Slavs named Madzhak/Muzhik (Brave), leader of Volyniana union, was lived in Western Ukraine in Volyn` region. After Madzhak capturing of Madzhak, kingdom was split into multiple western parts and eastern part, where 9 Slavic tribes were united by Rus` tribe, which abandoned their city, now known as Old Russa, because of epidemic. IMHO, it's will be fair to use rules of the oldest Slavic union. > Similarly to Mike's work on > collation: a common rules were created and all locales include them addin= g > their own language specific modifications. It's good idea too. In our own locale we prefer that words in our language will be at top of a sorted list. Currently, in Ukrainian locale it works as intended, but Russian locale has inverted order. IMHO, Russian locale should use Ukrainian rules. $ echo '=D0=BE=D0=B4=D0=B8=D0=BD =D0=B4=D0=B2=D0=B0 three four'| tr ' ' '\n= ' | LANG=3Duk_UA.utf8 sort =D0=B4=D0=B2=D0=B0 =D0=BE=D0=B4=D0=B8=D0=BD four three $ echo '=D0=BE=D0=B4=D0=B8=D0=BD =D0=B4=D0=B2=D0=B0 three four'| tr ' ' '\n= ' | LANG=3Dru_RU.utf8 sort four three =D0=B4=D0=B2=D0=B0 =D0=BE=D0=B4=D0=B8=D0=BD > > > [...] > > COMMIT MESSAGE: > > [...] > > I am excluding these locales from this proposed patch. I have written > > directly to locale maintainer emails listed in the files. Volodymyr > > Lisivka , Max Kutny (uk_UA), > > =D0=94=D0=B0=D0=BD=D0=B8=D0=BB=D0=BE =D0=A8=D0=B5=D0=B3=D0=B0=D0=BD (sr_YU, sr_CS) have confirmed the > > I am not sure if we want Cyrillic text in the commit message. Shouldn't > it be, uhm, tranlisterated? :-) > > "sr_CS" - I guess you meant "sr_RS". > > "sr_YU" has been dropped, do we want to mention it? > > > [...] > > [BZ #2872] > > * localedata/locales/translit_cyrillic: add ISO 9.1995, GOST 7.79 > > Please start "Add" with an uppercase. BTW, shouldn't it be "New file" > instead? > > > System A transliteration System B transcription table from Cyrillic to > > Latin/ASCII. > > * localedata/locales/C: add include "translit_cyrillic";"" to LC_CTYPE > > translit section. > > Same, "Add" here. > > > * localedata/locales/aa_DJ: Likewise. > > Good (here and everywhere below). > > > [...] > > diff -uNr a/localedata/locales/translit_cyrillic > > b/localedata/locales/translit_cyrillic > > --- a/localedata/locales/translit_cyrillic 1970-01-01 00:00:00.00000000= 0 > > +0000 > > +++ b/localedata/locales/translit_cyrillic 2018-10-09 19:02:54.00000000= 0 > > +0000 > > @@ -0,0 +1,383 @@ > > +escape_char / > > +comment_char % > > + > > +% This file is part of the GNU C Library and contains locale data. > > +% The Free Software Foundation does not claim any copyright interest > > +% in the locale data contained in this file. The foregoing does not > > +% affect the license of the GNU C Library as a whole. It does not > > +% exempt you from the conditions of the license if your use would > > +% otherwise be governed by that license. > > + > > +% Transliterations of cyrillic letters to latin and/or ascii symbols. > > "cyrillic" -> "Cyrillic"; "latin" -> "Latin"; "ascii" -> "ASCII". > > > +% Inspired by ISO 9.1995 / GOST 7.79-2000. > > +% Covers Unicode Range https://www.unicode.org/charts/PDF/U0400.pdf > > +% i.e [U4001-U4F9, U2019] but only the letters covered by ISO 9.1995 > > Typos: > > "i.e" -> "i.e.," (somebody please fix me if I'm wrong here) > "U4001" - I guess you meant "U0401" > "U4F9" -> "U04F9". I think that "U4F9" is not definitely bad but > let's be consistent. > > Also I can see some gaps in the range. Are you going to fill them > or maybe for now just mention that they exist? > > > +% It implements the GOST_7.79 System A (Latin Script) as a first > > +% option and System B Cyrillic (ASCII) as a second option. Check > > +% https://en.wikipedia.org/wiki/ISO_9 for reference. > > +% The System B is extended from GOST_7.79-Russian using open sources > > +% of the transliteration mappings and the "h/`" diacritics logic. > > What is "h/`" diacritics logic? > > > + > > +% Usage examples: > > +% iconv -f UTF-8 -t ISO-8859-15//TRANSLIT \ > > +% | iconv -f ISO-8859-15 -t UTF-8 # System A > > +% iconv -f UTF-8 -t ASCII//TRANSLIT # System B. > > + > > +% Contributions welcome for the rest of Cyrillic script in Unicode > > Sure, I'm not going to stop you from pushing these changes just because > there are missing characters. I will consider adding them later. > > > +% https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode. > > +% Bugfix for https://sourceware.org/bugzilla/show_bug.cgi?id=3D2872. > > +% Generated from UnicodeData.txt with > > +% https://sourceware.org/bugzilla/attachment.cgi?id=3D11301. > > 1. Is the file really generated with a script and not modified later? > If yes then maybe you should contribute the script instead? In that case= , > you should also not post this file to libc-locale, maintainers and > developers should be able to regenerate it. > 2. The link leads to a LibreOffice spreadsheet. > > > +LC_CTYPE > > + > > +translit_start > > + > > is missing here. Are you going to leave it for now? > > > +% CYRILLIC CAPITAL LETTER IO > > + ;"" > > [...] > > +% CYRILLIC CAPITAL LETTER KJE > > + ;"" > > is missing here. Can we add it already? > > > +% CYRILLIC CAPITAL LETTER SHORT U > > + ;"" > > [...] > > +% CYRILLIC CAPITAL LETTER U > > + > > +% CYRILLIC UNDEFINED > > + ;"" > > This still makes me wonder. > > Does it work at all? > What if we remove this rule, won't it be transliterated as > =3D> "U", - left unchanged, so "U" + " > will eventually produce "=C3=9A"? > Why is it called "UNDEFINED"? > Do we need similar rules for other characters? > > > [...] > > +% CYRILLIC SMALL LETTER U > > + > > +% CYRILLIC UNDEFINED > > + ;"" > > Same here. > > > [...] > > +% CYRILLIC SMALL LETTER YA > > + ;"" > > Again missing (because it is lowercase variant of ). > > > +% CYRILLIC SMALL LETTER IO > > + ;"" > > [...] > > +% CYRILLIC SMALL LETTER KJE > > + ;"" > > missing (same reason as ). > > > +% CYRILLIC SMALL LETTER SHORT U > > + ;"" > > +% CYRILLIC SMALL LETTER DZHE > > + "";"" > > More letters missing here. Is this because they are historic so we > don't want to include them now? Well, but "YUS" is also historic. > (Please, do not remove YUS for consistency). > > > +% CYRILLIC CAPITAL LETTER BIG YUS > > + ;"" > > +% CYRILLIC SMALL LETTER BIG YUS > > + ;"" > > [...] > > I will continue but, again, I don't give any ETA so other reviewers > are welcome here. > > Regards, > > Rafal