From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-2.6 required=3.0 tests=AWL,BAD_ENC_HEADER,BAYES_00, BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED, SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 0A8EC211B3 for ; Mon, 3 Dec 2018 22:19:18 +0000 (UTC) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:subject:to:references:from:cc:message-id:date :mime-version:in-reply-to:content-type :content-transfer-encoding; q=dns; s=default; b=cJEsxutxKowCQjIo XwUFThcUnRIcQH+F6bK2HXKxT8KL5lKrMeAoGd4x91bPejZZYm+/VB3Jic05FxXO a3G7N0/9crRCCb3jYAbIUpqsPBb4vNry7uWhLRP6GHG0NwYjvszTwMxL4k5mzS9S jA2DGwce82rAUh5pBo0r6uuiPsE= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:subject:to:references:from:cc:message-id:date :mime-version:in-reply-to:content-type :content-transfer-encoding; s=default; bh=mzaKqtgHSSvUcDKOAkS/rX 7et+E=; b=nhpsCQEqZmk+fS3ULG1McYr2GPoyyiGPRLKukZ5Zk9GdXzdUH4u52P 5QErNLfwAIN0FWzAj/WgXEOGxnmyG+UlZmviyYf+6jLvv2Gmm0Qj7vXtCl3Rvo4h QjFJemK6G4U2t2OajGUFKDZI9oYFTVuVmD4slfDfijlHeulPpMiWU= Received: (qmail 23780 invoked by alias); 3 Dec 2018 22:19:15 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 22567 invoked by uid 89); 3 Dec 2018 22:19:14 -0000 Authentication-Results: sourceware.org; auth=none X-HELO: mout.kundenserver.de Subject: Re: [PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] To: libc-alpha@sourceware.org, libc-locales@sourceware.org References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <837001401.21346.1542406647888@poczta.nazwa.pl> <5a247161-c498-ed50-ff4a-58f2ecf974f0@redhat.com> <1441622134.517912.1543702039942@poczta.nazwa.pl> From: Egor Kobylkin Openpgp: preference=signencrypt Cc: Marko Myllynen Message-ID: <2f6fc82c-77ba-d331-ae5d-e2373e122a88@kobylkin.com> Date: Mon, 3 Dec 2018 23:19:03 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1 MIME-Version: 1.0 In-Reply-To: <1441622134.517912.1543702039942@poczta.nazwa.pl> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Rafal, Just to touch base on this, what is the best way forward? Did you get any input/feedback on your questions below? Are you expecting input from anyone but myself? On the blocking issue #2: I really don’t see the connection to the uk_UA locale that has its transliteration table inline and is explicitly excluded from my patch. It may be revealing another issue you have with glibc but wouldn’t that be better addressed in a new bug? Again, in the v10 of my patch I have removed multicharacter source graphemes, so that issue is moot there. If you’d like to overhaul the glibc translit system wouldn’t it be better to commit the simple text file with the Cyrillic translit(transcription) table first, fix the bug from the year 2006 and then proceed from there all due diligence? The same with having both System A and System B. Initially I went along with the suggestion to include the system A but it is clear now that it doesn’t make fixing [BZ #2872] more straightforward. So I’d also propose to set it aside for the moment and use the v10 without the system A. That is the whole reason I have submitted it, to be superclear on that. Now you saw that uconv is transcribing «ХА» as KHA (cap/cap/cap) that should mitigate your concern about that issue too (somewhat, anyway). Making it context based would also be about adding new code, see above. Let me know if there’s anything I can help with getting more progress with the decision Bests, Egor On 16.11.18 23:17, Rafal Luzynski wrote: > 2. I made few tests in the command line and it seems to me that the > transliteration from "З" to "Z" (+ lowercase as well) in uk_UA does > not work and has not been working for some time already because I've > checked some older systems as well and the result is always the same. > I think that the reason is that uk_UA defines multiple > transliteration rules for "З" depending on what is the letter > following it. It does not seem to work. AFAIK the reason is that > the syntax of transliteration rules says that a single non-Latin > character may map one or more Latin strings, each consisting of one > or more characters. There cannot be a rule transliterating multiple > source characters into one or multiple destination characters. Is it > a bug in transliteration implementation? Or maybe in the > specification, including POSIX standard? > The definition of transliteration says that it is one-to-one mapping > of graphemes while a grapheme may be one or multiple characters. It > does not have to be always mapping one-to-one character. Should we > fix this bug first, make uk_UA transliteration work, and only then > add a generic Cyrillic transliteration? Egor's patch already > contains transliteration of "У" + combining acute accent to "Ú" which > most probably will not work. > > I still think that in the longer term all existing custom > transliterations of Cyrillic alphabets should be ported to a > modification of your patch. On 01.12.18 23:07, Rafal Luzynski wrote: > 19.11.2018 08:13 Marko Myllynen wrote: >> [...] >> Given the amount of questions above I think the way forward is to try >> follow the relevant standards as closely as possible and also check what >> the other implementations (i.e., uconv(1)) do. For example, checking the >> case earlier mentioned case may or may not give some hints: >> >> $ echo Шема | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin >> Šema >> $ echo Схема | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin >> Shema >> $ uconv -V >> uconv v2.1 ICU 50.1.2 > > I've played a little with uconv and unfortunately it does not look good > to me. > > It does not have any fallback transliteration to plain ASCII. When it says > that 'Ш' is transliterated to 'Š' then it always uses 'Š' and if the target > charset does not have this character then crashes: > > $ echo Шема | uconv -f UTF-8 -t ASCII -x cyrillic-latin > Conversion from Unicode to codepage failed at output byte position 0. > Unicode: 0160 Error: Invalid character found > $ echo Шема | uconv -f UTF-8 -t ISO-8859-1 -x cyrillic-latin > Conversion from Unicode to codepage failed at output byte position 0. > Unicode: 0160 Error: Invalid character found > $ echo Шема | uconv -f UTF-8 -t ISO-8859-2 -x cyrillic-latin > �ema > $ echo Шема | uconv -f UTF-8 -t ISO-8859-2 -x cyrillic-latin | uconv -f > ISO-8859-2 -t UTF-8 > Šema > > It seems to follow ISO 9 (GOST 7.79) System A. However, the transliteration > of the hard sign is rather strange: > > $ echo нъе | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin > nʺe > > The above was correct but: > > $ echo НЪЕ | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin > Nʺ̱E > $ echo Ъ | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin > ʺ̱ > $ echo Ъ | uconv -f UTF-8 -t UTF-16 -x cyrillic-latin| hexdump -x > 0000000 feff 02ba 0331 000a > 0000008 > > So this generates: > 02BA MODIFIER LETTER DOUBLE PRIME > 0331 COMBINING MACRON BELOW > > There is are more transliteration methods, for example Russian-Latin/BGN: > > $ echo Шема | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN > Shema > $ echo Схема | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN > Skhema > > Converting 'х' to 'kh' seems to be common in English transliteration but > it does not follow any ISO standard. > > $ echo ХА ха | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN > KHA kha > > This means that the choice whether a digraph in the output should be > all uppercase or maybe upper+lower is context based, something which we > probably cannot implement. But definitely a good thing. > > Two more tests: > > $ echo Ещё | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN > Yeshchë > $ echo Ещё | uconv -f UTF-8 -t ASCII -x Russian-Latin/BGN > Conversion from Unicode to codepage failed at output byte position 6. > Unicode: 00eb Error: Invalid character found > > So the output is not plain ASCII. > > $ echo е же ле не | uconv -f UTF-8 -t ASCII -x Russian-Latin/BGN > ye zhe le ne > > Again this means that transliteration of 'е' is context based: > it is 'ye' in the beginning of a word and 'e' otherwise. > > The version which I've tested: > > $ uconv -V > uconv v2.1 ICU 60.2 > > It seems that uconv will not be a good hint about transliterating > to plain ASCII. > > Also, the difference between uconv and iconv is that we can provide > multiple transliterations for any source character but we can't group > them into standards so we can't tell iconv to use this or another > system. It will just choose the best fitting the current output > character set and the only thing we can choose is the locale. > > This makes me think: should we add a locale like ru_RU@SystemA or > ru_RU@SystemB? > > Regards, > > Rafal >