Dear locale maintainers, fix the glibc bug 2872 "Transliteration Cyrillic -> ASCII fails" https://sourceware.org/bugzilla/show_bug.cgi?id=2872 [1] add the Cyrillic transliteration table translit_cyrillic file https://sourceware.org/bugzilla/attachment.cgi?id=11340 [7] to localedata/locales/ and include it in all your locales going forward. The patch included inline below. From this patch I have excluded locales that already mention cyrillic or have a transliteration table for it: mn_MN sr_RS tg_TJ tk_TM tt_RU uk_UA uz_UZ uz_UZ@cyrillic uk_UA Their maintainers are requested to make an explicit decision on how and whether at all to include this patch. Current bug effect: The glibc wiki explicitly lists this use case as the test example https://sourceware.org/glibc/wiki/Locales#Testing_Locales : LC_ALL=$LOCALE.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt currently it fails on Cyrillic texts in most locales including ru_RU [1] [8] [9]: LC_ALL=ru_RU.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt |grep CYRILLIC CYRILLIC ????? ??? ???? ?????? ??????????? ?????, ?? ????? ?? ???. - It produces a string of question marks and spaces. This is what it should produce and it does so after the patch applied: CYRILLIC S``esh` eshhyo e`tix myagkix franczuzskix bulok, da vy'pej zhe chayu. The root problem and the fix: The root problem is the missing transliteration table that I am supplying here. Furthermore it has to be referenced/included into the active locale at the compilation time to be used by iconv. COMMIT MESSAGE: This translit_cyrillic table enables conversion (e.g. with iconv) from a UTF-8 encoded text based on Cyrillic alphabet to a ASCII//TRANSLIT text. Examples: iconv -f UTF-8 -t ASCII//TRANSLIT will produce ASCII compatible transcription and iconv -f UTF-8 -t ISO-8859-15//TRANSLIT | iconv -f ISO-8859-15 -t UTF-8 will produce Latin transliteration as per ISO 9.1995. While a UTF-encoded Cyrillic text requires Cyrillic fonts the result of a transliteration/transcription has only Latin/ASCII codes but still can be read by a native speaker. Among other things it is useful for processing the Cyrillic texts and filenames by programs or on systems that are not specifically prepared to work with Cyrillic, don't have corresponding fonts installed or can't handle UTF-8. The transliteration table itself is attached as a file translit_cyrillic [7]. Its content (mapping) is based on ISO 9.1995 standard [10] and its derivative GOST 7.79-2000 official source (Federal Agency on Technical Regulating and Metrology Of Russian Federation [2]). Technically an independent but mostly identical source [3] was used and prepared in a spreadsheet [6]. The documentation suggests that the transliteration tables inclusion is done by adding *include "translit_cyrillic";""* string into LC_CTYPE translit_start section http://man7.org/linux/man-pages/man5/locale.5.html [5] Practically I have searched for all locales that have a translit_start/end stance and generated a patch for them. The Cyrillic transliteration of e.g. Russian text may have already worked to some extent for mn_MN, sr_RS, tk_TM, uz_UZ, uk_UA locales that have their transliteration tables included inline. I am excluding these locales from this proposed patch. I have written directly to locale maintainer emails listed in the files. Volodymyr Lisivka , Max Kutny (uk_UA), Данило Шеган (sr_RS) have confirmed the exclusion. Links: [1] This bug entry https://sourceware.org/bugzilla/show_bug.cgi?id=2872 [2] GOST 7.79-2000 official source http://protect.gost.ru/document.aspx?control=7&id=130715 (is only available in low quality gif format) [3] http://transliteration.ru/gost-7-79-2000/ and http://www.yfermer.ru/specifications/285821.html [4] Wikipedia article on Cyrillic transliteration with Latin alphabet https://ru.wikipedia.org/wiki/%D0%A2%D1%80%D0%B0%D0%BD%D1%81%D0%BB%D0%B8%D1%82%D0%B5%D1%80%D0%B0%D1%86%D0%B8%D1%8F_%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D0%B0%D0%BB%D1%84%D0%B0%D0%B2%D0%B8%D1%82%D0%B0_%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B8%D1%86%D0%B5%D0%B9 [5] http://man7.org/linux/man-pages/man5/locale.5.html [6] Spreadsheet for generating translit_cyrillic https://sourceware.org/bugzilla/attachment.cgi?id=11301 [7] translit_cyrillic https://sourceware.org/bugzilla/attachment.cgi?id=11340 [8] https://sourceware.org/glibc/wiki/Locales#Testing_Locales [9] translit-test-input.txt https://sourceware.org/bugzilla/attachment.cgi?id=11304 [10] https://en.wikipedia.org/wiki/ISO_9#ISO_9:1995,_or_GOST_7.79_System_A Best regards, Egor Kobylkin --- 2018-10-17 Egor Kobylkin [BZ #2872] * localedata/locales/translit_cyrillic: Add ISO 9.1995, GOST 7.79 System A transliteration System B transcription table from Cyrillic to Latin/ASCII. * localedata/locales/aa_DJ: Add 'include "translit_cyrillic";""' to LC_CTYPE translit section. * localedata/locales/af_ZA: Likewise. * localedata/locales/ak_GH: Likewise. * localedata/locales/am_ET: Likewise. * localedata/locales/ar_EG: Likewise. * localedata/locales/az_AZ: Likewise. * localedata/locales/be_BY: Likewise. * localedata/locales/bem_ZM: Likewise. * localedata/locales/ber_DZ: Likewise. * localedata/locales/ber_MA: Likewise. * localedata/locales/bg_BG: Likewise. * localedata/locales/bi_VU: Likewise. * localedata/locales/bn_BD: Likewise. * localedata/locales/bo_CN: Likewise. * localedata/locales/ca_ES: Likewise. * localedata/locales/ce_RU: Likewise. * localedata/locales/cmn_TW: Likewise. * localedata/locales/cs_CZ: Likewise. * localedata/locales/cv_RU: Likewise. * localedata/locales/cy_GB: Likewise. * localedata/locales/da_DK: Likewise. * localedata/locales/de_DE: Likewise. * localedata/locales/dv_MV: Likewise. * localedata/locales/dz_BT: Likewise. * localedata/locales/el_GR: Likewise. * localedata/locales/en_GB: Likewise. * localedata/locales/en_NG: Likewise. * localedata/locales/en_ZM: Likewise. * localedata/locales/es_CU: Likewise. * localedata/locales/es_ES: Likewise. * localedata/locales/et_EE: Likewise. * localedata/locales/fa_IR: Likewise. * localedata/locales/ff_SN: Likewise. * localedata/locales/fi_FI: Likewise. * localedata/locales/fr_FR: Likewise. * localedata/locales/ga_IE: Likewise. * localedata/locales/gd_GB: Likewise. * localedata/locales/gu_IN: Likewise. * localedata/locales/gv_GB: Likewise. * localedata/locales/he_IL: Likewise. * localedata/locales/hi_IN: Likewise. * localedata/locales/hif_FJ: Likewise. * localedata/locales/hr_HR: Likewise. * localedata/locales/ht_HT: Likewise. * localedata/locales/hu_HU: Likewise. * localedata/locales/hy_AM: Likewise. * localedata/locales/id_ID: Likewise. * localedata/locales/is_IS: Likewise. * localedata/locales/it_IT: Likewise. * localedata/locales/ja_JP: Likewise. * localedata/locales/kab_DZ: Likewise. * localedata/locales/kk_KZ: Likewise. * localedata/locales/km_KH: Likewise. * localedata/locales/kn_IN: Likewise. * localedata/locales/ko_KR: Likewise. * localedata/locales/ks_IN: Likewise. * localedata/locales/kw_GB: Likewise. * localedata/locales/ky_KG: Likewise. * localedata/locales/lb_LU: Likewise. * localedata/locales/lg_UG: Likewise. * localedata/locales/lij_IT: Likewise. * localedata/locales/ln_CD: Likewise. * localedata/locales/lo_LA: Likewise. * localedata/locales/lt_LT: Likewise. * localedata/locales/lv_LV: Likewise. * localedata/locales/mg_MG: Likewise. * localedata/locales/mhr_RU: Likewise. * localedata/locales/mk_MK: Likewise. * localedata/locales/ml_IN: Likewise. * localedata/locales/ms_MY: Likewise. * localedata/locales/mt_MT: Likewise. * localedata/locales/nan_TW@latin: Likewise. * localedata/locales/nb_NO: Likewise. * localedata/locales/ne_NP: Likewise. * localedata/locales/nhn_MX: Likewise. * localedata/locales/niu_NU: Likewise. * localedata/locales/niu_NZ: Likewise. * localedata/locales/nl_NL: Likewise. * localedata/locales/nr_ZA: Likewise. * localedata/locales/oc_FR: Likewise. * localedata/locales/om_KE: Likewise. * localedata/locales/or_IN: Likewise. * localedata/locales/os_RU: Likewise. * localedata/locales/pa_IN: Likewise. * localedata/locales/pa_PK: Likewise. * localedata/locales/pl_PL: Likewise. * localedata/locales/pt_PT: Likewise. * localedata/locales/quz_PE: Likewise. * localedata/locales/ro_RO: Likewise. * localedata/locales/ru_RU: Likewise. * localedata/locales/rw_RW: Likewise. * localedata/locales/sa_IN: Likewise. * localedata/locales/sd_IN: Likewise. * localedata/locales/sd_IN@devanagari: Likewise. * localedata/locales/se_NO: Likewise. * localedata/locales/sgs_LT: Likewise. * localedata/locales/shn_MM: Likewise. * localedata/locales/si_LK: Likewise. * localedata/locales/sk_SK: Likewise. * localedata/locales/sl_SI: Likewise. * localedata/locales/sm_WS: Likewise. * localedata/locales/so_SO: Likewise. * localedata/locales/sq_AL: Likewise. * localedata/locales/ss_ZA: Likewise. * localedata/locales/st_ZA: Likewise. * localedata/locales/sv_SE: Likewise. * localedata/locales/sw_KE: Likewise. * localedata/locales/ta_IN: Likewise. * localedata/locales/te_IN: Likewise. * localedata/locales/th_TH: Likewise. * localedata/locales/ti_ET: Likewise. * localedata/locales/tn_ZA: Likewise. * localedata/locales/to_TO: Likewise. * localedata/locales/tpi_PG: Likewise. * localedata/locales/tr_TR: Likewise. * localedata/locales/ts_ZA: Likewise. * localedata/locales/unm_US: Likewise. * localedata/locales/ur_IN: Likewise. * localedata/locales/ur_PK: Likewise. * localedata/locales/ve_ZA: Likewise. * localedata/locales/vi_VN: Likewise. * localedata/locales/wa_BE: Likewise. * localedata/locales/wo_SN: Likewise. * localedata/locales/xh_ZA: Likewise. * localedata/locales/yi_US: Likewise. * localedata/locales/yuw_PG: Likewise. * localedata/locales/zh_CN: Likewise. * localedata/locales/zu_ZA: Likewise.