Changelog v11: * Re-targeted the patch against locale/C-translit.h.in as the proper file for the ASCII translit table. * Correspondingly the patch now only contains the additional Cyrillic-ASCII strings in the format of locale/C-translit.h.in table. The 'include "translit_cyrillic";""' directives are not necessary in the locale files and they are now all left intact. * Also the file translit_cyrillic is not longer needed and is omitted. * Edited below email, commit message. Changelog v10: * Removed ISO 9.1995 GOST 7.79-2000 System A (transliteration to Latin with diacritics) as conflicting with System B within glibc mechanics and not solving BZ #2872 * Edited below email, commit message, comment in translit_cyrillic to reflect System A removal * Removed and (Cyrillic U with acute, using composition) as composing is not covered by current glibc conversion mechanics Changelog v9: * Fixed formatting (trailing spaces etc.) * Put commit summary in the patch file, now it is generated completely by git format-patch Changelog v8: * Re-added missing translit_cyrillic in patch v7 (due to missing "git add" in the script). Changelog v7: * Generated against git://sourceware.org/git/glibc.git master with git format-patch. * The 'include "translit_cyrillic";""' now immediately follows last 'include "translit_XXX";""' string (was inserted just before translit_end previously.) * Only the locales already having 'include .*translit.*;""' are patched (see the list for manual exclusions below, full list of included locales at the end of the email in the commit section.) * Excluded az_AZ completely to avoid circular reference from tr_TR via “copy "tr_TR"”. Changelog v6: * Locales removed from the patch: C and sd_PK. * Added locales: az_AZ and ky_KG. * Consistently transliterate single uppercase Cyrillic letters to sequences of all uppercase Latin letters in all languages (whenever a Cyrillic letter is transliterated to more than one Latin letter), for example "Ї" is now transliterated as "YI" rather than "Yi". Dear locale maintainers, fix the glibc bug 2872 "Transliteration Cyrillic -> ASCII fails" https://sourceware.org/bugzilla/show_bug.cgi?id=2872 [1] add the Cyrillic transliteration rows to locale/C-translit.h.in. The patch is attached. Current bug effect: The glibc wiki explicitly lists this use case as the test example and currently it fails on Cyrillic texts [1] [8] [9]: iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt |grep CYRILLIC CYRILLIC ????? ??? ???? ?????? ??????????? ?????, ?? ????? ?? ???. - it produces a string of question marks and spaces. This is what it should produce and it does so after the patch applied: CYRILLIC S``esh` eshhyo e`tix myagkix franczuzskix bulok, da vy'pej zhe chayu. The root problem and the fix: The root problem is the missing transliteration table that I am supplying here. COMMIT MESSAGE: This translit_cyrillic table enables conversion (e.g. with iconv) from a UTF-8 encoded text based on Cyrillic alphabet to a ASCII//TRANSLIT text. Example: iconv -f UTF-8 -t ASCII//TRANSLIT will produce ASCII compatible transcription. While a UTF-encoded Cyrillic text requires Cyrillic fonts the result of a transliteration/transcription has only Latin/ASCII codes but still can be read by a native speaker. Among other things it is useful for processing the Cyrillic texts and filenames by programs or on systems that are not specifically prepared to work with Cyrillic, don't have corresponding fonts installed or can't handle UTF-8. The patch content (mapping) is based on ISO 9.1995 standard [10] and its derivative GOST 7.79-2000 System B official source (Federal Agency on Technical Regulating and Metrology Of Russian Federation [2]). Technically an independent but mostly identical source [3] was used and prepared in a spreadsheet [6]. The transliteration of Cyrillic to ASCII according to GOST 7.79-2000 System B represents what is actually called transcription (preserving phonemes), while System A is the transliteration (preserving graphemes). There is no meaningful way to preserve graphemes converting Cyrillic to ASCII and thus the System B is chosen [11]. To be super clear the System A has nothing to do with this bug regardless it being a transliteration. Those interested in implementing System A for transliteration of Cyrillic to Latin with Diacritic as a new feature are welcome to use the spreadsheet in [6] as a starting point. Links: [1] This bug entry https://sourceware.org/bugzilla/show_bug.cgi?id=2872 [2] GOST 7.79-2000 official source http://protect.gost.ru/document.aspx?control=7&id=130715 (is only available in low quality gif format) [3] http://transliteration.ru/gost-7-79-2000/ and http://www.yfermer.ru/specifications/285821.html [4] Wikipedia article on Cyrillic transliteration with Latin alphabet https://ru.wikipedia.org/wiki/%D0%A2%D1%80%D0%B0%D0%BD%D1%81%D0%BB%D0%B8%D1%82%D0%B5%D1%80%D0%B0%D1%86%D0%B8%D1%8F_%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D0%B0%D0%BB%D1%84%D0%B0%D0%B2%D0%B8%D1%82%D0%B0_%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B8%D1%86%D0%B5%D0%B9 [5] http://man7.org/linux/man-pages/man5/locale.5.html [6] Spreadsheet for generating translit_cyrillic https://sourceware.org/bugzilla/attachment.cgi?bugid=2872&action=viewall&hide_obsolete=1 [8] https://sourceware.org/glibc/wiki/Locales#Testing_Locales [9] translit-test-input.txt https://sourceware.org/bugzilla/attachment.cgi?id=11304 [10] https://en.wikipedia.org/wiki/ISO_9#GOST_7.79_System_B [11] https://scriptsource.org/cms/scripts/page.php?item_id=entry_detail&uid=gslmka8xq3 Best regards, Egor Kobylkin