From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-98098-e=80x24.org@sourceware.org>
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN: AS31976 209.132.180.0/23
X-Spam-Status: No, score=-3.9 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_EF,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS
	shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by dcvr.yhbt.net (Postfix) with ESMTPS id 1EC2E20A1E
	for <e@80x24.org>; Fri,  7 Dec 2018 23:35:33 +0000 (UTC)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:date:from:to:message-id:in-reply-to:references
	:subject:mime-version:content-type:content-transfer-encoding; q=
	dns; s=default; b=SlCN8JM68isXaiR8Tk8Qd9A/srG/jHUamsm50j8wRC5g/W
	1twfslmmDn4iuZ5Hljgf9p+IK9F6Yr4SAiDuqJY3IQQDSmeEnFwV/2igl3H5dHAI
	teN+l9ZmdU5cURKZbM1HpLnN326qpQhtha5GFNkbLCebPIIZ5r/pPlulRDeSg=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:date:from:to:message-id:in-reply-to:references
	:subject:mime-version:content-type:content-transfer-encoding; s=
	default; bh=uxtIo9W4Q+sZMHk6m05BGOd0em8=; b=rp+mSQW9uqqjcap+Mp8/
	FMC+vewZ0p8yTUtApUSy5hap1qGyjhuqSEKwec/w4jIbM75N0oZeeI+p01/2uR2k
	g8U8H3sfLlYXvJb7N3fDJaMYSwjih7FXBiJnH5m5T12LoxpTb2QrUdCJJ1GrwcX3
	WsTx9KKLc6v7lTp6hjZLM+M=
Received: (qmail 6777 invoked by alias); 7 Dec 2018 23:35:28 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-e=80x24.org@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 6753 invoked by uid 89); 7 Dec 2018 23:35:27 -0000
Authentication-Results: sourceware.org; auth=none
X-HELO: shared-ano163.rev.nazwa.pl
Date: Sat, 8 Dec 2018 00:35:56 +0100 (CET)
From: Rafal Luzynski <digitalfreak@lingonborough.com>
To: Egor Kobylkin <egor@kobylkin.com>, libc-alpha@sourceware.org,
	libc-locales@sourceware.org
Message-ID: <1718190635.706992.1544225756803@poczta.nazwa.pl>
In-Reply-To: <676c37bd-ba92-a7ed-019e-94974143233f@kobylkin.com>
References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com>
 <20180412224352.GB2911@altlinux.org>
 <676c37bd-ba92-a7ed-019e-94974143233f@kobylkin.com>
Subject: Re: [PATCH v10] Locales: Cyrillic -> ASCII transliteration table
 [BZ #2872]
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

19.11.2018 12:10 Egor Kobylkin <egor@kobylkin.com> wrote:
>=20
> Changelog v10:
> * Removed ISO 9.1995 GOST 7.79-2000 System A (transliteration to Latin
> with diacritics) as conflicting with System B within glibc mechanics and
> not solving BZ #2872

I'm in favor of implementing System A and dropping System B instead.
If I understand correctly, System A is actually ISO 9, therefore it is
international, universal, and neutral, while System B is a GOST standard
and therefore used only in Russia (also adopted in several other countries
as well).

It's true that we can't handle both System A and System B.  What we
would like to have is:


             System A
          /=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D> OUTPUT: Latin with diacrit=
ics
  INPUT <    System B
          \=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D> OUTPUT: Plain ASCII (fallb=
ack)

That means: use one system but if the output can't handle it then switch
to another system.

But what we can actually have is either:

          System A                                    Fallback
  INPUT =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D> OUTPUT: Latin with diacritics=
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D> Plain
ASCII

or:

          System B
  INPUT =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D> OUTPUT: Plain ASCII

That means, we can only provide a fallback for individual characters,
we can't provide a fallback algorithm (that is, we can't switch to
transliterating '=D0=A5' as 'X' instead of 'H' just because we can't
transliterate
'=D0=A8' as '=C5=A0' and switch to 'SH' instead).

Wouldn't it be better to implement ISO 9 (System A) instead and provide
a fallback ASCII transliteration which could be similar but not identical
to System B?  Is it necessary to provide plain ASCII transliteration
conforming to System B even if that means that we would have not to
implement System A?  If yes, would it be correct to provide System B
for ru_RU (and maybe few more locales) but include System A in all other
locales (except few which we exclude already)?

> * Edited below email, commit message, comment in translit_cyrillic to
> reflect System A removal
> * Removed <U0423><U0301> and <U0443><U0301> (Cyrillic U with acute,
> using composition) as composing is not covered by current glibc
> conversion mechanics

OK, thank you, I like this change.

> [...]
> The transliteration of Cyrillic to ASCII according to GOST 7.79-2000
> System B represents what is actually called transcription (preserving
> phonemes), while System A is the transliteration (preserving graphemes).
> There is no meaningful way to preserve graphemes converting Cyrillic to
> ASCII and thus the System B is chosen. [11]

I'm not sure it should be actually called transcription.  IIUC,
transcription
reflects pronunciation, something we can't easily implement in glibc.
As long as we convert letters to letters (or group of letters to group
of letters) without taking pronunciation into account it should be
called transliteration.  OTOH, I agree that it is rather uncommon in
Russian language to find an example where pronunciation is not perfectly
reflected in spelling.

> +% Generated from UnicodeData.txt with a spreadsheet referenced
> +% in that bugs doclet

The previous versions of your patch had "in that bug's doclet" here
which I think is correct.

I like the version 9 of your patch more so I'm going to write a more
thorough review of it.

Regards,

Rafal