From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.2 required=3.0 tests=AWL,BAD_ENC_HEADER,BAYES_00, BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED, SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 0F8281F87F for ; Fri, 16 Nov 2018 22:17:45 +0000 (UTC) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:date:from:to:message-id:in-reply-to:references :subject:mime-version:content-type:content-transfer-encoding; q= dns; s=default; b=fN9Kf5TDZwNHEeMc27MGBDDfXWuhkppVqY5uvmtNBAPThe IDcAqWWS+2PGop0ZiTUf6CmF4fAuo8I6xfYiQHXievM6aPZ8v+jQV6uQwuvf9HDn 2c2H83n2BSMCs1P3SSm5elrGygrhZBMS4zL9OyzYc6bL5P1YdqjQQxIJDYtVg= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:date:from:to:message-id:in-reply-to:references :subject:mime-version:content-type:content-transfer-encoding; s= default; bh=oh3ha3WfA6EviQSiOkoAi3rWkjk=; b=e9c9J/jPElxnzLkw7a6O BXu3vo68t0UQy5ba8irkneA5l8apcD0TQuvhifS2jo59zkD9aTFRXvwD07j15A85 t1BcMeBVe9H0tyPp0EgMSl9P+HopOF/VjSTqisaZ4r1j70659TM8Rt02g5xk/rNA AaJS3OIdcAQamIP/uqeboAI= Received: (qmail 96951 invoked by alias); 16 Nov 2018 22:17:42 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 96927 invoked by uid 89); 16 Nov 2018 22:17:41 -0000 Authentication-Results: sourceware.org; auth=none X-HELO: shared-ano163.rev.nazwa.pl Date: Fri, 16 Nov 2018 23:17:27 +0100 (CET) From: Rafal Luzynski To: Egor Kobylkin , libc-alpha@sourceware.org, libc-locales@sourceware.org Message-ID: <837001401.21346.1542406647888@poczta.nazwa.pl> In-Reply-To: References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> Subject: Re: [PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Thank you for working on this, Egor. Before I start reviewing I would like to summarize the things which I think are blocking for this patch. 1. I think we need tests for transliteration. Currently there is only one test program which is similar to what we need, localedata/bug-iconv-trans.c. It is old and it is not quite clear what bug it is trying to test. Therefore I think we need a new framework to test transliteration. Is it a good idea to base the test on the iconv(1) command line utility which is part of glibc? 2. I made few tests in the command line and it seems to me that the transliteration from "=D0=97" to "Z" (+ lowercase as well) in uk_UA does not work and has not been working for some time already because I've checked some older systems as well and the result is always the same. I think that the reason is that uk_UA defines multiple transliteration rules for "=D0=97" depending on what is the letter follo= wing it. It does not seem to work. AFAIK the reason is that the syntax of transliteration rules says that a single non-Latin character may map one or more Latin strings, each consisting of one or more characters. There cannot be a rule transliterating multiple source characters into one or multiple destination characters. Is it a bug in transliteration implementation? Or maybe in the specification, including POSIX standard= ? The definition of transliteration says that it is one-to-one mapping of graphemes while a grapheme may be one or multiple characters. It does not have to be always mapping one-to-one character. Should we fix this bug first, make uk_UA transliteration work, and only then add a generic Cyrillic transliteration? Egor's patch already contains transliteration of "=D0=A3" + combining acute accent to "=C3=9A" which m= ost probably will not work. I still think that in the longer term all existing custom transliterations of Cyrillic alphabets should be ported to a modification of your patch. Egor, while at this I was thinking about your idea to transliterate letters like "=D0=A8" (uppercase) to "SH" (always uppercase) in order to distinguis= h between "=D0=A8=D0=B5=D0=BC=D0=B0" (-> "SHema") and "=D0=A1=D1=85=D0=B5=D0= =BC=D0=B0" (-> "Shema" or "Sxema"). Also you include a rule to transliterate "=D0=A5" to "H" or "X" depending on whi= ch destination characters are available, which I told you already that will not work because both "H" and "X" are always available and therefore only the first rule will always be used. I still don't like the idea to put two uppercase letters in a beginning of a word in titlecase only to indicate that there was originally a single letter. What if we: * drop the rule of transliterating "=D0=A5" to "H" and transliterate always= to "X", * transliterate uppercase "=D0=A8" to "Sh" (so it will work fine for titlec= ase words)? As a result the Latin letter "h" will only appear as part of a digraph and never as a transliteration of "=D0=A5" and therefore will never cause a con= flict. Examples: * "=D0=A8=D0=B5=D0=BC=D0=B0" -> "Shema", * "=D0=A1=D1=85=D0=B5=D0=BC=D0=B0" -> "Sxema". Will this solve the problem? Regards, Rafal