From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-4.0 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_EF,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 92DE620248 for ; Tue, 16 Apr 2019 07:15:47 +0000 (UTC) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:reply-to:subject:to:cc:references:from :message-id:date:mime-version:in-reply-to:content-type :content-transfer-encoding; q=dns; s=default; b=cw8CDAy1vGsSTFrs T6EgiPxzuoQ5PSZQMayF58vfTg8E6o0OJwfoiZkZY3XDudwQzQTXQ7neWF9sckUc x5JWLtuYHF4sbqoCvKNpBfQkdivAk9FtORyCCZS2+/NQVcbfc1OQN1BNJymJ+lak wc0lFu3g8caDA+x5kU5y8YZCeqA= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:reply-to:subject:to:cc:references:from :message-id:date:mime-version:in-reply-to:content-type :content-transfer-encoding; s=default; bh=YRvrkaH+3qNVMnE4CLZpOv Z1xMQ=; b=QbT+o5NinJOmNrXYZxyd6yQWB8aNxdOkqzSbgBmV0e7WdGcT4uNoSO VqOj/4NZsjaecJEz/ZibAVW2QN+vIsJntUjBravNW8oeSv5XkYQBeLTqPbM8zTCy Z2NL+yxk+LjgI0c41bXfeCI5a9Ohh1TvZnMakw/gsIh4aoamiUj40= Received: (qmail 12325 invoked by alias); 16 Apr 2019 07:15:45 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 12172 invoked by uid 89); 16 Apr 2019 07:15:44 -0000 Authentication-Results: sourceware.org; auth=none X-HELO: mail-wr1-f66.google.com Reply-To: Marko Myllynen Subject: [PING^6][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] To: libc-alpha@sourceware.org, libc-locales@sourceware.org, Carlos O'Donell , Siddhesh Poyarekar , Rafal Luzynski Cc: Mike Fabian , Egor Kobylkin References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <7cdd817a-4a47-201a-8eeb-87db324104b3@kobylkin.com> From: Marko Myllynen Message-ID: <8923a5a0-65c8-4784-6d7d-f3571933dcb5@redhat.com> Date: Tue, 16 Apr 2019 10:15:33 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: <7cdd817a-4a47-201a-8eeb-87db324104b3@kobylkin.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Ping? On 19/03/2019 12.39, Egor Kobylkin wrote: > Changelog v12: > * Adjusted to the new comment style suddenly appearing in the target > file locale/C-translit.h.in (the original file changed on the master > branch from /* style to # style since v11) > * Fixed a typo for CYRILLIC SMALL LETTER SHHA to be mapped to > "sh`" instead of erroneous "SH`" in v11 > > Changelog v11: > * Re-targeted the patch against locale/C-translit.h.in as the proper > file for the ASCII translit table. > * Correspondingly the patch now only contains the additional > Cyrillic-ASCII strings in the format of locale/C-translit.h.in table. > The 'include "translit_cyrillic";""' directives are not necessary in the > locale files and they are now all left intact. > * Also the file translit_cyrillic is not longer needed and is omitted. > * Edited below email, commit message. > > Changelog v10: > * Removed ISO 9.1995 GOST 7.79-2000 System A (transliteration to Latin > with diacritics) as conflicting with System B within glibc mechanics and > not solving BZ #2872 > * Edited below email, commit message, comment in translit_cyrillic to > reflect System A removal > * Removed and (Cyrillic U with acute, > using composition) as composing is not covered by current glibc > conversion mechanics > > Changelog v9: > * Fixed formatting (trailing spaces etc.) > * Put commit summary in the patch file, now it is generated completely > by git format-patch > > Changelog v8: > * Re-added missing translit_cyrillic in patch v7 (due to missing "git > add" in the script). > > Changelog v7: > * Generated against git://sourceware.org/git/glibc.git master with git > format-patch. > * The 'include "translit_cyrillic";""' now immediately follows last > 'include "translit_XXX";""' string (was inserted just before > translit_end previously.) > * Only the locales already having 'include .*translit.*;""' are patched > (see the list for manual exclusions below, full list of included locales > at the end of the email in the commit section.) > * Excluded az_AZ completely to avoid circular reference from tr_TR via > “copy "tr_TR"”. > > Changelog v6: > * Locales removed from the patch: C and sd_PK. > * Added locales: az_AZ and ky_KG. > * Consistently transliterate single uppercase Cyrillic letters >    to sequences of all uppercase Latin letters in all languages (whenever >    a Cyrillic letter is transliterated to more than one Latin letter), >    for example "Ї" is now transliterated as "YI" rather than "Yi". > > Dear locale maintainers, > > fix the glibc bug 2872 "Transliteration Cyrillic -> ASCII fails" > > https://sourceware.org/bugzilla/show_bug.cgi?id=2872 [1] > > add the Cyrillic transliteration rows to locale/C-translit.h.in. > > The patch is attached. > > > Current bug effect: > > The glibc wiki explicitly lists this use case as the test example and > currently it fails on Cyrillic texts [1] [8] [9]: > > iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt |grep CYRILLIC > > CYRILLIC ????? ??? ???? ?????? ??????????? ?????, ?? ????? ?? ???. > > - it produces a string of question marks and spaces. > > This is what it should produce and it does so after the patch applied: > > CYRILLIC S``esh` eshhyo e`tix myagkix franczuzskix bulok, da vy'pej zhe > chayu. > > > The root problem and the fix: > > The root problem is the missing transliteration table that I am > supplying here. > > > COMMIT MESSAGE: > This translit_cyrillic table enables conversion (e.g. with iconv) from a > UTF-8 encoded text based on Cyrillic alphabet to a ASCII//TRANSLIT text. > > Example: iconv -f UTF-8 -t ASCII//TRANSLIT will produce ASCII > compatible transcription. > > While a UTF-encoded Cyrillic text requires Cyrillic fonts the result of > a transliteration/transcription has only Latin/ASCII codes but still can > be read by a native speaker. Among other things it is useful for > processing the Cyrillic texts and filenames by programs or on systems > that are not specifically prepared to work with Cyrillic, don't have > corresponding fonts installed or can't handle UTF-8. > > The patch content (mapping) is based on ISO 9.1995 standard [10] and its > derivative GOST 7.79-2000 System B official source (Federal Agency on > Technical Regulating and Metrology Of Russian Federation [2]). > Technically an independent but mostly identical source [3] was used and > prepared in a spreadsheet [6]. > > The transliteration of Cyrillic to ASCII according to GOST 7.79-2000 > System B represents what is actually called transcription (preserving > phonemes), while System A is the transliteration (preserving graphemes). > There is no meaningful way to preserve graphemes converting Cyrillic to > ASCII and thus the System B is chosen [11]. To be super clear the System > A has nothing to do with this bug regardless it being a transliteration. > > Those interested in implementing System A for transliteration of > Cyrillic to Latin with Diacritic as a new feature are welcome to use the > spreadsheet in [6] as a starting point. > > Links: > > [1] This bug entry https://sourceware.org/bugzilla/show_bug.cgi?id=2872 > [2] GOST 7.79-2000 official source > http://protect.gost.ru/document.aspx?control=7&id=130715 (is only > available in low quality gif format) > [3] http://transliteration.ru/gost-7-79-2000/ and > http://www.yfermer.ru/specifications/285821.html > [4] Wikipedia article on Cyrillic transliteration with Latin alphabet > https://ru.wikipedia.org/wiki/%D0%A2%D1%80%D0%B0%D0%BD%D1%81%D0%BB%D0%B8%D1%82%D0%B5%D1%80%D0%B0%D1%86%D0%B8%D1%8F_%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D0%B0%D0%BB%D1%84%D0%B0%D0%B2%D0%B8%D1%82%D0%B0_%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B8%D1%86%D0%B5%D0%B9 > > [5] http://man7.org/linux/man-pages/man5/locale.5.html > [6] Spreadsheet for generating translit_cyrillic > https://sourceware.org/bugzilla/attachment.cgi?bugid=2872&action=viewall&hide_obsolete=1 > > [8] https://sourceware.org/glibc/wiki/Locales#Testing_Locales > [9] translit-test-input.txt > https://sourceware.org/bugzilla/attachment.cgi?id=11304 > [10] https://en.wikipedia.org/wiki/ISO_9#GOST_7.79_System_B > [11] > https://scriptsource.org/cms/scripts/page.php?item_id=entry_detail&uid=gslmka8xq3 > > > Best regards, > Egor Kobylkin > > > > -- Marko Myllynen