From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-2.5 required=3.0 tests=AWL,BAYES_00,BODY_8BITS, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 9B7721F87F for ; Sat, 17 Nov 2018 18:35:07 +0000 (UTC) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:subject:to:references:from:message-id:date :mime-version:in-reply-to:content-type :content-transfer-encoding; q=dns; s=default; b=aqmBkAl7nndvEXM3 8Ol++o36BYWcRGpgf+mhTOZzn7FVdk1O7XFJp63dzWjM4DEXaNJLSx2Y2/IWGEOr nz4bMTTa/3rs2CfnnpPBSM4+g8raXrwgMid1wWfUuASq4jh+ISbLmhnSlwY8yWdi gt5ytKKwCRux/S3JE6Hc6wejc3k= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:subject:to:references:from:message-id:date :mime-version:in-reply-to:content-type :content-transfer-encoding; s=default; bh=u+vRDkMEF+c00PPAdroz/5 ObifA=; b=dmV4OUSb4uJbHWj04MIgwgigilfUrjSHYEwpKkgfUQx10WKBdHfA2+ ZdR+6lszcDjv09PaUI+cXDaZrwaAYCX0/LWbdHvr0SSLzlMvO8FyBqe8DklPFuVV CEBZOI1Bi7w5Jv5IRXWvCKxq9U5kTqNOO84IPGA05wZPeoKueSd5M= Received: (qmail 131017 invoked by alias); 17 Nov 2018 18:35:01 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 130978 invoked by uid 89); 17 Nov 2018 18:35:00 -0000 Authentication-Results: sourceware.org; auth=none X-HELO: mout.kundenserver.de Subject: Re: [PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] To: Rafal Luzynski , libc-alpha@sourceware.org, libc-locales@sourceware.org, Marko Myllynen References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <837001401.21346.1542406647888@poczta.nazwa.pl> From: Egor Kobylkin Openpgp: preference=signencrypt Message-ID: Date: Sat, 17 Nov 2018 19:34:45 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1 MIME-Version: 1.0 In-Reply-To: <837001401.21346.1542406647888@poczta.nazwa.pl> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Hi Rafal, thanks for putting it into a clear issue statement on SH/Sh problem. I'm totally with you on this being a good thing to discuss. It is orthogonal to the tests so let me focus on SH/Sh and System A/B problematic here. Looks like we have three issues: 1. lack of explicit control which transformation to use (System A or System B) via //TRANSLIT 2. possibility of collision for System B if used CAP/low transcription for capital letters 3. Cyrillic 'Х'/'х' (ha) never transcribes to 'H'/'h' as it should per System B because it's equivalent 'X'/'x' from System A is always present and takes precedence. As a solution shouldn't we only keep System B in a new file transcribe_cyrillic and put it in place as the explicit ASCII transcription for targeted locales (as opposed to transliteration)? We would keep System A as translit_cyrillic but won't include it into this patch. Once you have resolved an issue of having two conflicting rule-sets but only one key //TRANSLIT you could add the System A back. The SH/Sh can be decided on either way - seems like an easy change any way. Please see more discussion on your excellent points below: On 16.11.18 23:17, Rafal Luzynski wrote: > Egor, while at this I was thinking about your idea to transliterate > letters like "Ш" (uppercase) to "SH" (always uppercase) in order to > distinguish between "Шема" (-> "SHema") and "Схема" (-> "Shema" or > "Sxema"). to clarify, this SH/Sh collision issue relates only to iconv -f UTF-8 -t ASCII//TRANSLIT (i.e. System B transcription). But it's not only SH/Sh, there are following combinations used to transcribe capital letters: YO, DJ, YE, TSH, DH, ZH, CZ, CH, SH, SHH, YU, YA, FH, YH, GH, NG, TCZ Arguably any of them (if not in that CAP/CAP form) could collide with their CAP/low equivalent from a different word. (there may be language grammar rules that in fact prevent some but we don't know for sure) With transcription we are basically striping information from the data, mapping it into a smaller character set. The idea to keep them in CAP/CAP is to try to preserve as much information as possible. > Also you include a rule to transliterate "Х" to "H" or "X" depending > on which destination characters are available, which I told you > already that will not work because both "H" and "X" are always > available and therefore only the first rule will always be used. Just to have this here for reference, the idea was to have both rules in one file so iconv -f UTF-8 -t ASCII//TRANSLIT will produce ASCII compatible _transcription_ (System B) iconv -f UTF-8 -t ISO-8859-15//TRANSLIT | iconv -f ISO-8859-15 -t UTF-8 will produce Latin _transliteration_ as per ISO 9.1995. (System A) So in fact we have two rules for each letter in the same file (System A and System B), where System A takes precedence. I have a question then: isn't this more like a hack than a right thing to do? Shouldn't we have two explicit rules for transcription and transliteration not dependent on a destination character set? > I still don't like the idea to > put two uppercase letters in a beginning of a word in titlecase only > to indicate that there was originally a single letter. What if we: > > * drop the rule of transliterating "Х" to "H" and transliterate > always to "X", This would contradict ISO 9.1995. (System A). System A was added on Marko's request (so setting him on TO:) I am neutral on keeping it or dropping it, just to be clear. > * transliterate uppercase "Ш" to "Sh" (so it will work fine for > titlecase words)? > > As a result the Latin letter "h" will only appear as part of a > digraph and never as a transliteration of "Х" and therefore will > never cause a conflict. Examples: > > * "Шема" -> "Shema", * "Схема" -> "Sxema". > > Will this solve the problem? This particular rule with h/x would make sense it's own. But again - it would contradict the standards. On the other hand, for my personal needs I care less about standards but about current functionality and data loss because of missing transcription altogether due to the BZ #2872. Bests, Egor