Hi Rafal On 11.10.2018 13:04, Rafal Luzynski wrote: > Thank you, Egor. I am looking at your patch and although I have > not yet finished, here are some remarks: > > First of all, I think that such a large patch should also include > the tests. Please see how automatic tests are performed in locale > data and write your own. Could you please point me to the existing automatic tests? Locally I am using the test suggested in glibc locales wiki. From my commit message: "The glibc wiki explicitly lists this use case as the test example https://sourceware.org/glibc/wiki/Locales#Testing_Locales : LC_ALL=$LOCALE.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt " I am visually checking whether any iconv run fails for all those locales but you must refer to some automated unit test with a boolean outcome, right? > > 11.10.2018 00:29 Egor Kobylkin wrote: >> [...] >> From this patch I have excluded locales that already mention cyrillic or >> have a transliteration table for it: >> az_AZ >> iso14651_t1_common >> ky_KG >> mn_MN >> sr_RS >> tg_TJ >> tk_TM >> tt_RU >> uk_UA >> uz_UZ >> uz_UZ@cyrillic >> [...] > > I think that eventually we would like to include your translit_cyrillic > also in these locales because I assume that your rules should work good > for them as well, also should include more characters than the individual > language contributors took into account. Similarly to Mike's work on > collation: a common rules were created and all locales include them adding > their own language specific modifications. This is fine with me. Should anybody supply translit_xxxxxxxxxxxx for any of the mentioned locales we can include them as well. Wouldn't it be easier to coordinate those as separate patches though? > >> [...] >> COMMIT MESSAGE: >> [...] >> I am excluding these locales from this proposed patch. I have written >> directly to locale maintainer emails listed in the files. Volodymyr >> Lisivka , Max Kutny (uk_UA), >> Данило Шеган (sr_YU, sr_CS) have confirmed the > > I am not sure if we want Cyrillic text in the commit message. Shouldn't > it be, uhm, tranlisterated? :-) I do not see any Cyrillic text in the commit message. the ?????? you see are the actual "?" symbols coming out of iconv now. > > "sr_CS" - I guess you meant "sr_RS". > > "sr_YU" has been dropped, do we want to mention it? The list of locales and the patch itself is generated from the actual locales - I do not hand pick them, only exclude the ones in the exclusion list above. > >> [...] >> [BZ #2872] >> * localedata/locales/translit_cyrillic: add ISO 9.1995, GOST 7.79 > > Please start "Add" with an uppercase. BTW, shouldn't it be "New file" > instead? > >> System A transliteration System B transcription table from Cyrillic to >> Latin/ASCII. >> * localedata/locales/C: add include "translit_cyrillic";"" to LC_CTYPE >> translit section. > > Same, "Add" here. > >> * localedata/locales/aa_DJ: Likewise. > > Good (here and everywhere below). > >> [...] >> diff -uNr a/localedata/locales/translit_cyrillic >> b/localedata/locales/translit_cyrillic >> --- a/localedata/locales/translit_cyrillic 1970-01-01 00:00:00.000000000 >> +0000 >> +++ b/localedata/locales/translit_cyrillic 2018-10-09 19:02:54.000000000 >> +0000 >> @@ -0,0 +1,383 @@ >> +escape_char / >> +comment_char % >> + >> +% This file is part of the GNU C Library and contains locale data. >> +% The Free Software Foundation does not claim any copyright interest >> +% in the locale data contained in this file. The foregoing does not >> +% affect the license of the GNU C Library as a whole. It does not >> +% exempt you from the conditions of the license if your use would >> +% otherwise be governed by that license. >> + >> +% Transliterations of cyrillic letters to latin and/or ascii symbols. > > "cyrillic" -> "Cyrillic"; "latin" -> "Latin"; "ascii" -> "ASCII". > >> +% Inspired by ISO 9.1995 / GOST 7.79-2000. >> +% Covers Unicode Range https://www.unicode.org/charts/PDF/U0400.pdf >> +% i.e [U4001-U4F9, U2019] but only the letters covered by ISO 9.1995 > > Typos: > > "i.e" -> "i.e.," (somebody please fix me if I'm wrong here) > "U4001" - I guess you meant "U0401" > "U4F9" -> "U04F9". I think that "U4F9" is not definitely bad but > let's be consistent. These are all good catches. I will fix them and resubmit. > > Also I can see some gaps in the range. Are you going to fill them > or maybe for now just mention that they exist? > No, were not going to fill them please see this: On 10.10.2018 14:34, Marko Myllynen wrote: > On 2018-10-10 15:19, Egor Kobylkin wrote: >> On 10.10.2018 13:22, Marko Myllynen wrote: >>>> correct link https://sourceware.org/bugzilla/attachment.cgi?id=11303 >>> Although I haven't checked every rule this in general looks very good >>> (but see below). >>> Not sure do we want to add the few missing characters >>> mentioned at https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode, >>> e.g., one instantly notices that U+0400 is missing. (I wouldn't add at >>> least initially the more exotic characters, like the historic ones, >>> though.) Perhaps filing a bug or two for these cases for separate >>> consideration would be ok. >> The question here is what should serve as their transliteration and >> transcription? > Not sure, so filing a separate bug about this once your patch is merged > might be the most suitable action for now, I don't think we want to > postpone merging your work further due to these non-ISO 9 cases. > >> +% It implements the GOST_7.79 System A (Latin Script) as a first >> +% option and System B Cyrillic (ASCII) as a second option. Check >> +% https://en.wikipedia.org/wiki/ISO_9 for reference. >> +% The System B is extended from GOST_7.79-Russian using open sources >> +% of the transliteration mappings and the "h/`" diacritics logic. > > What is "h/`" diacritics logic? Basically some Linguist mentioned that they have chosen "h" and '`" to represent the diacritics for the transcription (i.e. GOST 7.79 System B). This way there is some resemblance to the watertight transliteration as per ISO 9 (Sysetem A) but it is still all in ASCII. We have decided to extend GOST 7.79 to the all ISO 9 characters and so I have extended it following that Linguist logic. >> +% https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode. >> +% Bugfix for https://sourceware.org/bugzilla/show_bug.cgi?id=2872. >> +% Generated from UnicodeData.txt with >> +% https://sourceware.org/bugzilla/attachment.cgi?id=11301. > > 1. Is the file really generated with a script and not modified later? > If yes then maybe you should contribute the script instead? In that case, > you should also not post this file to libc-locale, maintainers and > developers should be able to regenerate it. > 2. The link leads to a LibreOffice spreadsheet. No, I do not have a script. The "generated" means it is a result of formulas in that spreadsheet. People are welcome to write a script that should be straightforward implementation of those rules in formulas. > >> +LC_CTYPE >> + >> +translit_start >> + > > is missing here. Are you going to leave it for now? Yes, it is to be left out, not in ISO 9. See the exchange with Marko above. > >> +% CYRILLIC CAPITAL LETTER IO >> + ;"" >> [...] >> +% CYRILLIC CAPITAL LETTER KJE >> + ;"" > > is missing here. Can we add it already? Yes, it is to be left out, not in ISO 9. See the exchange with Marko above. > >> +% CYRILLIC CAPITAL LETTER SHORT U >> + ;"" >> [...] >> +% CYRILLIC CAPITAL LETTER U >> + >> +% CYRILLIC UNDEFINED >> + ;"" > > This still makes me wonder. > > Does it work at all? > What if we remove this rule, won't it be transliterated as > => "U", - left unchanged, so "U" + " > will eventually produce "Ú"? > Why is it called "UNDEFINED"? On 10.10.2018 14:34, Marko Myllynen wrote: > On 2018-10-10 15:19, Egor Kobylkin wrote: >> On 10.10.2018 13:22, Marko Myllynen wrote: ... >>> I'm not sure this will work, no existing rule in translit_* files >>> contain two characters, I'd assume that the rule for U+0423 is applied >>> first and then the below rule is never used. >>> >>> % CYRILLIC UNDEFINED >>> ;"" >>> >>> Perhaps this should be commented out or removed altogether if it's not >>> working as intended. >> >> So yes, they are not processed. I would drop them to not to have special >> cases. But I am also fine with keeping them because all work is done >> already. > I'd probably drop them but I don't feel strongly about this either way. > > Thanks for your efforts, I don't have any further comments, I'll leave > this now for Rafal and Mike to provide additional feedback and hopefully > merge soon. Could you also please check the discussion with Marko on UNDEFINED and other related topics? You were on To: or CC: for those emails. The same for the other characters below. > Do we need similar rules for other characters? > >> [...] >> +% CYRILLIC SMALL LETTER U >> + >> +% CYRILLIC UNDEFINED >> + ;"" > > Same here. > >> [...] >> +% CYRILLIC SMALL LETTER YA >> + ;"" > > Again missing (because it is lowercase variant of ). > >> +% CYRILLIC SMALL LETTER IO >> + ;"" >> [...] >> +% CYRILLIC SMALL LETTER KJE >> + ;"" > > missing (same reason as ). > >> +% CYRILLIC SMALL LETTER SHORT U >> + ;"" >> +% CYRILLIC SMALL LETTER DZHE >> + "";"" > > More letters missing here. Is this because they are historic so we > don't want to include them now? Well, but "YUS" is also historic. > (Please, do not remove YUS for consistency). > >> +% CYRILLIC CAPITAL LETTER BIG YUS >> + ;"" >> +% CYRILLIC SMALL LETTER BIG YUS >> + ;"" >> [...] > > I will continue but, again, I don't give any ETA so other reviewers > are welcome here. > > Regards, > > Rafal > Bests, Egor