From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-96269-e=80x24.org@sourceware.org>
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN: AS31976 209.132.180.0/23
X-Spam-Status: No, score=-3.8 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED,
	DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,
	SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no
	version=3.4.1
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by dcvr.yhbt.net (Postfix) with ESMTPS id EED871F453
	for <e@80x24.org>; Fri,  5 Oct 2018 11:54:22 +0000 (UTC)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:reply-to:subject:to:cc:references:from
	:message-id:date:mime-version:in-reply-to:content-type
	:content-transfer-encoding; q=dns; s=default; b=ko0VAQAzcLdI9vgR
	JY1Ojg0DLmCFljuDac8RpMOoTwezU1CksmqHXncmpL70ECATpDy+ASA1WAqDL5U/
	AqnCT/WONa+DaMCth5L2X6i4uND0kuNcro94e2qxW3zPxGmNT7/UIuzI2NmB8mRj
	lYJwwYY+hSfDydL5fawvVY3qV2g=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:reply-to:subject:to:cc:references:from
	:message-id:date:mime-version:in-reply-to:content-type
	:content-transfer-encoding; s=default; bh=JkoQE+9mZkmW9Dh8haE40L
	LFjhY=; b=RF9LPUmvWUffbsAaqzLWDfoRwUPZUohubCWlkhFPHPEDRnQILL5foE
	jMSppPZKI1Rxrc0Cce0nf6p6IBqFT6gXD+B/ZqgW8IiaT+QADXu0CGCoVouMDr2r
	FIFmVvdPObDoUAXZbQTM2FQ+R9sggkVXUzBNPjq0/4llyUmRfDcGE=
Received: (qmail 60968 invoked by alias); 5 Oct 2018 11:54:19 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-e=80x24.org@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 60945 invoked by uid 89); 5 Oct 2018 11:54:18 -0000
Authentication-Results: sourceware.org; auth=none
X-HELO: mail-wm1-f68.google.com
Reply-To: Marko Myllynen <myllynen@redhat.com>
Subject: Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ
 #2872] re-submission for 2.29
To: Egor Kobylkin <egor@kobylkin.com>,
 Rafal Luzynski <digitalfreak@lingonborough.com>,
 Keld Simonsen <keld@keldix.com>
Cc: libc-alpha@sourceware.org, libc-locales@sourceware.org,
 "Dmitry V. Levin" <ldv@altlinux.org>, Volodymyr Lisivka
 <vlisivka@gmail.com>, Carlos O'Donell <carlos@redhat.com>,
 Max Kutny <mkutny@gmail.com>, danilo@gnome.org
References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com>
 <20180412224352.GB2911@altlinux.org>
 <16e785f3-2e9f-ceb2-698f-dc33c91a5d5e@kobylkin.com>
 <ac4c9b3e-aeae-30de-23ef-24d8f53d7bc4@kobylkin.com>
 <20181003091949.GA21486@rap.rap.dk>
 <21d872b2-613e-d1f5-26c0-baa4b5721df9@kobylkin.com>
 <1485772360.805333.1538731225156@poczta.nazwa.pl>
 <deacdf31-d0bb-a92d-1de3-934d6b4cb158@kobylkin.com>
From: Marko Myllynen <myllynen@redhat.com>
Message-ID: <bda2ca60-18f1-3b19-91e5-c9ad144bc834@redhat.com>
Date: Fri, 5 Oct 2018 14:54:10 +0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.9.1
MIME-Version: 1.0
In-Reply-To: <deacdf31-d0bb-a92d-1de3-934d6b4cb158@kobylkin.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

Hi,

Would it make sense to first use ISO 9:1995/GOST 7.79 System A if
possible and if not, then fall back to GOST 7.79 System B?

Implementation-wise current translit_* files have few examples where a
non-ASCII transliteration is tried first before an ASCII fallback. These
examples are from translit_neutral:

% NARROW NO-BREAK SPACE
<U202F> <U00A0>;<U0020>
% REVERSED TRIPLE PRIME
<U2037> "<U2035><U2035><U2035>";"<U0060><U0060><U0060>"

Thanks,

On 2018-10-05 13:29, Egor Kobylkin wrote:
> Keld,Marko,Rafal, other locale maintainers,
> 
> this all is written with having in mind a minimal viable fix for this
> bug asap. I want to avoid wasting maintainers time getting into
> fundamental discussions here (although for perfectly good reasons).
> 
> I see three options:
> 1. those locale maintainers that are fine with using ISO
> 9:1995/GOST_7.79_System_B cyrillic transliteration table (Ru) include it
> in their locales (see attached screenshot of the table).
> 2. those that that want to have a differing table can create their own
> variety based on the spreadsheet I have prepared
> https://sourceware.org/bugzilla/attachment.cgi?id=8590 and include it in
> this patch.
> 3. those that want to omit a cyrillic transliteration altogether for now
> state so and just carry over the bug #2872 from the year 2006.
> 
> Does this make sense to you?
> 
> Just to be super clear on this: the patch is a stopgap _ASCII_
> transliteration table. ASCII being AMERICAN Standard Code for
> Information Interchange, that is obviously orthogonal to any
> transliteration rule of other countries. As such it is not explicitly
> targeting transliteration standards of any country.
> 
> The fact that the patch is reflecting Russian variety of ISO
> 9:1995/GOST_7.79_System_B is because a) ISO 9:1995/GOST_7.79_System_B is
> available and can be helpful to a majority of cyrillic users b) I have
> access to it including via being proficient in Russian.
> 
> It is offered to all the respective locale maintainers as a stopgap
> solution. Stopgap in the sense that it is better to have some
> transliteration than not to have any at all and carry over the bug from
> 2006. That it may be a somewhat officially correct transliteration for
> ru_RU is a bonus. In that sense I would dub the discussion on the
> correctness for other languages "offtopic". Let me know if this is not OK.
> 
> You are all are correctly mentioning the deficiencies of this approach.
> However, I couldn't find a better straightforward approach as of yet.
> Happy to hear from you as on how this could be handled.
> 
> There is a danger of being caught in the web of language/country
> differences. I propose just pruning the locales that are not comfortable
> including this current table. We can address possible solutions in the
> second wave of patching.
> 
> I am vary of getting into discussions on specific country variants just
> because of the sheer complexity of this topic. It is probably better
> addressed by respective maintainers of their locales. I do not see a
> "one fits all" solution in this first wave possible.
> 
> I would like to have this "three options plan of action" vetted first
> and then we could go to the specific detail. (Like, for instance, what
> characters should be included in to the table, and in which
> transliteration form.)
> 
> I am looking forward to your reply,
> Egor Kobylkin
> 
> P.S. specifically as to how address languages other than Ru included in
> GOST_7.79_System_B: we can take the first option left to right from that
> table (Ru,By,Uk,Bg,Mk). Then it will technically work for all those
> locales/languages but with errors where Ru supersedes their own variants.
> 
> 
> On 05.10.2018 11:20, Rafal Luzynski wrote:
>> 3.10.2018 11:32 Egor Kobylkin <egor@kobylkin.com> wrote:
>>>
>>> On 03.10.2018 11:19, Keld Simonsen wrote:
>>>> Hi
>>>>
>>>> Please note that translitteration of Cyrillic to latin is not universal.
>>>> There are different schemes for for example German, English and Danish, and
>>>> there is also an ISO standard for it.
>>>
>>> Thanks for your feedback, Keld!
>>>
>>> Could the locale maintainers that wouldn't like to include this patch
>>> explicitly state so here?
>>
>> I think it is about me so I must reply.  I am sorry about that and the sole
>> reason is my lack of time.  I'm just a volunteer here, that means it's not
>> my regular job to work on locale data nor anything in glibc nor in any other
>> open source project.  I do these things only in my free time which I don't
>> have much.  Of course you will see my contributions here and there but they
>> are either trivial or take me months to complete.  Your patches are on my
>> radar but I can't tell any ETA for them.  Of course, there are other people
>> around here and they are all welcome to come and join.
>>
>>> That is:
>>> - In the case that there is a different preferred cyrillic
>>> transliteration table for any specific locale their maintainers may want
>>> to point me to it so I can supply a separate table/patch.
>>> - Or they could state explicitly that for some reason they would like to
>>> exclude their locale from the patch for a default cyrillic
>>> transliteration altogether.
>>
>> As Keld wrote, there are probably separate rules for every language so
>> I don't think you should treat your rules as universal and include them
>> in every locale.  At first sight, it seems to me they work only for English
>> (as a destination locale).  Also, although it is called "transliteration
>> from Cyrillic" it seems that it covers only Russian alphabet.  What about
>> other languages which use Cyrillic alphabet but add their own diacritic
>> characters?  Think about Belarusian, Ukrainian, Serbian, Chechen, Chuvash,
>> Mari, Ossetian, Yakut, Tatar, and more.  What about languages which use
>> Cyrillic alphabet but transliterate their respective letters in a different
>> way than Russian?  For example, Russian "Ъ" is (I think) usually skipped
>> in transliteration, I think you propose "``", but when transliterating from
>> Bulgarian they usually transliterate this as "ă".
>>
>> Few remarks:
>>
>> * I think you transliterate "щ" as "shh", wouldn't "shch" be better?
>> * You transliterate "ц" as "cz", wouldn't "ts" be better?  By the way,
>>   in Polish language "cz" is a correct transliteration of "ч".
>> * You transliterate "й" as "j", this is fine in many languages but wouldn't
>>   "y" be better in English?
>> * In case of "е": how will you know if it is correct to transliterate it
>>   to "e" or "ie" or "je" or "ye"?
>>
>> These remarks are obviously incomplete, your patch deserves much more
>> attention to review.
>>
>> Best regards,
>>
>> Rafal
>>
> 


-- 
Marko Myllynen