From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.9 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_EF,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 1EC2E20A1E for ; Fri, 7 Dec 2018 23:35:33 +0000 (UTC) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:date:from:to:message-id:in-reply-to:references :subject:mime-version:content-type:content-transfer-encoding; q= dns; s=default; b=SlCN8JM68isXaiR8Tk8Qd9A/srG/jHUamsm50j8wRC5g/W 1twfslmmDn4iuZ5Hljgf9p+IK9F6Yr4SAiDuqJY3IQQDSmeEnFwV/2igl3H5dHAI teN+l9ZmdU5cURKZbM1HpLnN326qpQhtha5GFNkbLCebPIIZ5r/pPlulRDeSg= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:date:from:to:message-id:in-reply-to:references :subject:mime-version:content-type:content-transfer-encoding; s= default; bh=uxtIo9W4Q+sZMHk6m05BGOd0em8=; b=rp+mSQW9uqqjcap+Mp8/ FMC+vewZ0p8yTUtApUSy5hap1qGyjhuqSEKwec/w4jIbM75N0oZeeI+p01/2uR2k g8U8H3sfLlYXvJb7N3fDJaMYSwjih7FXBiJnH5m5T12LoxpTb2QrUdCJJ1GrwcX3 WsTx9KKLc6v7lTp6hjZLM+M= Received: (qmail 6777 invoked by alias); 7 Dec 2018 23:35:28 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 6753 invoked by uid 89); 7 Dec 2018 23:35:27 -0000 Authentication-Results: sourceware.org; auth=none X-HELO: shared-ano163.rev.nazwa.pl Date: Sat, 8 Dec 2018 00:35:56 +0100 (CET) From: Rafal Luzynski To: Egor Kobylkin , libc-alpha@sourceware.org, libc-locales@sourceware.org Message-ID: <1718190635.706992.1544225756803@poczta.nazwa.pl> In-Reply-To: <676c37bd-ba92-a7ed-019e-94974143233f@kobylkin.com> References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <676c37bd-ba92-a7ed-019e-94974143233f@kobylkin.com> Subject: Re: [PATCH v10] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable 19.11.2018 12:10 Egor Kobylkin wrote: >=20 > Changelog v10: > * Removed ISO 9.1995 GOST 7.79-2000 System A (transliteration to Latin > with diacritics) as conflicting with System B within glibc mechanics and > not solving BZ #2872 I'm in favor of implementing System A and dropping System B instead. If I understand correctly, System A is actually ISO 9, therefore it is international, universal, and neutral, while System B is a GOST standard and therefore used only in Russia (also adopted in several other countries as well). It's true that we can't handle both System A and System B. What we would like to have is: System A /=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D> OUTPUT: Latin with diacrit= ics INPUT < System B \=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D> OUTPUT: Plain ASCII (fallb= ack) That means: use one system but if the output can't handle it then switch to another system. But what we can actually have is either: System A Fallback INPUT =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D> OUTPUT: Latin with diacritics= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D> Plain ASCII or: System B INPUT =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D> OUTPUT: Plain ASCII That means, we can only provide a fallback for individual characters, we can't provide a fallback algorithm (that is, we can't switch to transliterating '=D0=A5' as 'X' instead of 'H' just because we can't transliterate '=D0=A8' as '=C5=A0' and switch to 'SH' instead). Wouldn't it be better to implement ISO 9 (System A) instead and provide a fallback ASCII transliteration which could be similar but not identical to System B? Is it necessary to provide plain ASCII transliteration conforming to System B even if that means that we would have not to implement System A? If yes, would it be correct to provide System B for ru_RU (and maybe few more locales) but include System A in all other locales (except few which we exclude already)? > * Edited below email, commit message, comment in translit_cyrillic to > reflect System A removal > * Removed and (Cyrillic U with acute, > using composition) as composing is not covered by current glibc > conversion mechanics OK, thank you, I like this change. > [...] > The transliteration of Cyrillic to ASCII according to GOST 7.79-2000 > System B represents what is actually called transcription (preserving > phonemes), while System A is the transliteration (preserving graphemes). > There is no meaningful way to preserve graphemes converting Cyrillic to > ASCII and thus the System B is chosen. [11] I'm not sure it should be actually called transcription. IIUC, transcription reflects pronunciation, something we can't easily implement in glibc. As long as we convert letters to letters (or group of letters to group of letters) without taking pronunciation into account it should be called transliteration. OTOH, I agree that it is rather uncommon in Russian language to find an example where pronunciation is not perfectly reflected in spelling. > +% Generated from UnicodeData.txt with a spreadsheet referenced > +% in that bugs doclet The previous versions of your patch had "in that bug's doclet" here which I think is correct. I like the version 9 of your patch more so I'm going to write a more thorough review of it. Regards, Rafal