From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-97326-e=80x24.org@sourceware.org>
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN: AS31976 209.132.180.0/23
X-Spam-Status: No, score=-3.2 required=3.0 tests=AWL,BAD_ENC_HEADER,BAYES_00,
	BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,
	SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham
	autolearn_force=no version=3.4.2
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by dcvr.yhbt.net (Postfix) with ESMTPS id 0F8281F87F
	for <e@80x24.org>; Fri, 16 Nov 2018 22:17:45 +0000 (UTC)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:date:from:to:message-id:in-reply-to:references
	:subject:mime-version:content-type:content-transfer-encoding; q=
	dns; s=default; b=fN9Kf5TDZwNHEeMc27MGBDDfXWuhkppVqY5uvmtNBAPThe
	IDcAqWWS+2PGop0ZiTUf6CmF4fAuo8I6xfYiQHXievM6aPZ8v+jQV6uQwuvf9HDn
	2c2H83n2BSMCs1P3SSm5elrGygrhZBMS4zL9OyzYc6bL5P1YdqjQQxIJDYtVg=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:date:from:to:message-id:in-reply-to:references
	:subject:mime-version:content-type:content-transfer-encoding; s=
	default; bh=oh3ha3WfA6EviQSiOkoAi3rWkjk=; b=e9c9J/jPElxnzLkw7a6O
	BXu3vo68t0UQy5ba8irkneA5l8apcD0TQuvhifS2jo59zkD9aTFRXvwD07j15A85
	t1BcMeBVe9H0tyPp0EgMSl9P+HopOF/VjSTqisaZ4r1j70659TM8Rt02g5xk/rNA
	AaJS3OIdcAQamIP/uqeboAI=
Received: (qmail 96951 invoked by alias); 16 Nov 2018 22:17:42 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-e=80x24.org@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 96927 invoked by uid 89); 16 Nov 2018 22:17:41 -0000
Authentication-Results: sourceware.org; auth=none
X-HELO: shared-ano163.rev.nazwa.pl
Date: Fri, 16 Nov 2018 23:17:27 +0100 (CET)
From: Rafal Luzynski <digitalfreak@lingonborough.com>
To: Egor Kobylkin <egor@kobylkin.com>, libc-alpha@sourceware.org,
	libc-locales@sourceware.org
Message-ID: <837001401.21346.1542406647888@poczta.nazwa.pl>
In-Reply-To: <b82fe65b-b880-a2b5-c97d-2a6aae9c1165@kobylkin.com>
References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com>
 <20180412224352.GB2911@altlinux.org>
 <b82fe65b-b880-a2b5-c97d-2a6aae9c1165@kobylkin.com>
Subject: Re: [PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ
 #2872]
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Thank you for working on this, Egor.

Before I start reviewing I would like to summarize the things which
I think are blocking for this patch.

1. I think we need tests for transliteration.  Currently there is only
   one test program which is similar to what we need,
   localedata/bug-iconv-trans.c.  It is old and it is not quite clear
   what bug it is trying to test.  Therefore I think we need a new
   framework to test transliteration.  Is it a good idea to base the
   test on the iconv(1) command line utility which is part of glibc?

2. I made few tests in the command line and it seems to me that the
   transliteration from "=D0=97" to "Z" (+ lowercase as well) in uk_UA does
   not work and has not been working for some time already because
   I've checked some older systems as well and the result is always
   the same.  I think that the reason is that uk_UA defines multiple
   transliteration rules for "=D0=97" depending on what is the letter follo=
wing
   it.  It does not seem to work.  AFAIK the reason is that the syntax of
   transliteration rules says that a single non-Latin character may map
   one or more Latin strings, each consisting of one or more characters.
   There cannot be a rule transliterating multiple source characters into
   one or multiple destination characters.  Is it a bug in transliteration
   implementation?  Or maybe in the specification, including POSIX standard=
?
   The definition of transliteration says that it is one-to-one mapping
   of graphemes while a grapheme may be one or multiple characters.
   It does not have to be always mapping one-to-one character.  Should we
   fix this bug first, make uk_UA transliteration work, and only then
   add a generic Cyrillic transliteration?  Egor's patch already contains
   transliteration of "=D0=A3" + combining acute accent to "=C3=9A" which m=
ost
probably
   will not work.

I still think that in the longer term all existing custom transliterations
of Cyrillic alphabets should be ported to a modification of your patch.

Egor, while at this I was thinking about your idea to transliterate letters
like "=D0=A8" (uppercase) to "SH" (always uppercase) in order to distinguis=
h
between "=D0=A8=D0=B5=D0=BC=D0=B0" (-> "SHema") and "=D0=A1=D1=85=D0=B5=D0=
=BC=D0=B0" (-> "Shema" or "Sxema").  Also
you include a rule to transliterate "=D0=A5" to "H" or "X" depending on whi=
ch
destination characters are available, which I told you already that will
not work because both "H" and "X" are always available and therefore only
the first rule will always be used.  I still don't like the idea to
put two uppercase letters in a beginning of a word in titlecase only to
indicate that there was originally a single letter.  What if we:

* drop the rule of transliterating "=D0=A5" to "H" and transliterate always=
 to
"X",
* transliterate uppercase "=D0=A8" to "Sh" (so it will work fine for titlec=
ase
  words)?

As a result the Latin letter "h" will only appear as part of a digraph and
never as a transliteration of "=D0=A5" and therefore will never cause a con=
flict.
Examples:

* "=D0=A8=D0=B5=D0=BC=D0=B0" -> "Shema",
* "=D0=A1=D1=85=D0=B5=D0=BC=D0=B0" -> "Sxema".

Will this solve the problem?

Regards,

Rafal