Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29

unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Egor Kobylkin <egor@kobylkin.com>
To: Marko Myllynen <myllynen@redhat.com>,
	Rafal Luzynski <digitalfreak@lingonborough.com>,
	Keld Simonsen <keld@keldix.com>
Cc: libc-alpha@sourceware.org, libc-locales@sourceware.org,
	"Dmitry V. Levin" <ldv@altlinux.org>,
	Volodymyr Lisivka <vlisivka@gmail.com>,
	Carlos O'Donell <carlos@redhat.com>, Max Kutny <mkutny@gmail.com>,
	danilo@gnome.org
Subject: Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29
Date: Fri, 5 Oct 2018 22:47:09 +0200	[thread overview]
Message-ID: <b8f02fe9-f911-487f-b50b-9b0c43191cb6@kobylkin.com> (raw)
In-Reply-To: <69e26cab-810e-824b-3b16-b75ac44d8b0c@redhat.com>

After some kind help from Marko in the offline discussion
I realized the multi/single character approach I originally took was
against the  of the iconv(1) logic anyway. So there is no harm in
dropping it and adopting Marko's suggestion instead. I will do so and
will resubmit the patch with ISO 9:1995/GOST 7.79 System A + fallback to
GOST 7.79 System B (for ASCII).

However this doesn't resolve the issue for ASCII part being different
for various locales. Again, I am offering the locale maintainers to let
me know if they want to 1) adopt the one I am supplying, 2) write their
own or 3) ignore the patch altogether. Your feedback is appreciated!

This is the relevant part that helped:
> The first part (ISO-8859-15 or ASCII) defines the target encoding for
> iconv(1). //TRANSLIT is described in the iconv(1) man page as:
> 
> If the string //TRANSLIT is appended to to-encoding,  characters 
> being  converted  are  transliterated  when needed and possible. This
> means that when a character cannot be  represented  in  the target
> character set, it can be approximated through one or sev‐ eral
> similar looking characters.  Characters that are outside of the
> target  character  set  and  cannot  be  transliterated are replaced
> with a question mark (?) in the output.
> 
> So in the above examples, iconv(1) encounters the character U+0428
> which is not part of either of the target encoding and since
> //TRANSLIT is specified, iconv(1) tries transliteration according to
> the rules defined above, in case of ASCII U+0160 is not part of the
> target encoding so the next alternative is used.

Bests,
Egor Kobylkin

On 05.10.2018 14:21, Marko Myllynen wrote:
> Hi,
> 
> The scheme I proposed would also be ASCII compatible; consider this 
> example:
> 
> % CYRILLIC CAPITAL LETTER SHA <U0428> "<U0160>";"<U0053><U0068>"
> 
> "printf \\u0428\\n | iconv -f UTF-8 -t ISO-8859-15//TRANSLIT | iconv 
> -f ISO-8859-15 -t UTF-8" would produce Š as per System A and "printf
>  \\u0428\\n | iconv -f UTF-8 -t ASCII//TRANSLIT" would produce Sh as 
> per System B.
> 
> Thanks,
> 
> On 2018-10-05 15:00, Egor Kobylkin wrote:
>> Hi Marko,
>> 
>> I have chosen the System B because it is ASCII compartible. System 
>> A is not ASCII compartible (diacritics in target).
>> 
>> https://en.wikipedia.org/wiki/ISO_9#ISO_9:1995,_or_GOST_7.79_System_A
>>
>>
>> 
"GOST 7.79 contains two transliteration tables.
>> 
>> System A one Cyrillic character to one Latin character, some with 
>> diacritics – identical to ISO 9:1995
>> 
>> System B one Cyrillic character to one or many Latin characters 
>> without diacritics " Hope this helps, Egor
>> 
>> On 05.10.2018 13:54, Marko Myllynen wrote:
>>> Hi,
>>> 
>>> Would it make sense to first use ISO 9:1995/GOST 7.79 System A if
>>> possible and if not, then fall back to GOST 7.79 System B?
>>> 
>>> Implementation-wise current translit_* files have few examples 
>>> where a non-ASCII transliteration is tried first before an ASCII 
>>> fallback. These examples are from translit_neutral:
>>> 
>>> % NARROW NO-BREAK SPACE <U202F> <U00A0>;<U0020> % REVERSED
>>> TRIPLE PRIME <U2037>
>>> "<U2035><U2035><U2035>";"<U0060><U0060><U0060>"
>>> 
>>> Thanks,
>>> 
>>> On 2018-10-05 13:29, Egor Kobylkin wrote:
>>>> Keld,Marko,Rafal, other locale maintainers,
>>>> 
>>>> this all is written with having in mind a minimal viable fix 
>>>> for this bug asap. I want to avoid wasting maintainers time 
>>>> getting into fundamental discussions here (although for 
>>>> perfectly good reasons).
>>>> 
>>>> I see three options: 1. those locale maintainers that are fine 
>>>> with using ISO 9:1995/GOST_7.79_System_B cyrillic 
>>>> transliteration table (Ru) include it in their locales (see 
>>>> attached screenshot of the table). 2. those that that want to 
>>>> have a differing table can create their own variety based on 
>>>> the spreadsheet I have prepared 
>>>> https://sourceware.org/bugzilla/attachment.cgi?id=8590 and 
>>>> include it in this patch. 3. those that want to omit a
>>>> cyrillic transliteration altogether for now state so and just
>>>> carry over the bug #2872 from the year 2006.
>>>> 
>>>> Does this make sense to you?
>>>> 
>>>> Just to be super clear on this: the patch is a stopgap _ASCII_
>>>>  transliteration table. ASCII being AMERICAN Standard Code for
>>>>  Information Interchange, that is obviously orthogonal to any 
>>>> transliteration rule of other countries. As such it is not 
>>>> explicitly targeting transliteration standards of any country.
>>>> 
>>>> The fact that the patch is reflecting Russian variety of ISO 
>>>> 9:1995/GOST_7.79_System_B is because a) ISO 
>>>> 9:1995/GOST_7.79_System_B is available and can be helpful to a 
>>>> majority of cyrillic users b) I have access to it including
>>>> via being proficient in Russian.
>>>> 
>>>> It is offered to all the respective locale maintainers as a 
>>>> stopgap solution. Stopgap in the sense that it is better to 
>>>> have some transliteration than not to have any at all and
>>>> carry over the bug from 2006. That it may be a somewhat
>>>> officially correct transliteration for ru_RU is a bonus. In
>>>> that sense I would dub the discussion on the correctness for
>>>> other languages "offtopic". Let me know if this is not OK.
>>>> 
>>>> You are all are correctly mentioning the deficiencies of this 
>>>> approach. However, I couldn't find a better straightforward 
>>>> approach as of yet. Happy to hear from you as on how this
>>>> could be handled.
>>>> 
>>>> There is a danger of being caught in the web of 
>>>> language/country differences. I propose just pruning the 
>>>> locales that are not comfortable including this current table. 
>>>> We can address possible solutions in the second wave of 
>>>> patching.
>>>> 
>>>> I am vary of getting into discussions on specific country 
>>>> variants just because of the sheer complexity of this topic.
>>>> It is probably better addressed by respective maintainers of
>>>> their locales. I do not see a "one fits all" solution in this
>>>> first wave possible.
>>>> 
>>>> I would like to have this "three options plan of action"
>>>> vetted first and then we could go to the specific detail.
>>>> (Like, for instance, what characters should be included in to
>>>> the table, and in which transliteration form.)
>>>> 
>>>> I am looking forward to your reply, Egor Kobylkin
>>>> 
>>>> P.S. specifically as to how address languages other than Ru 
>>>> included in GOST_7.79_System_B: we can take the first option 
>>>> left to right from that table (Ru,By,Uk,Bg,Mk). Then it will 
>>>> technically work for all those locales/languages but with 
>>>> errors where Ru supersedes their own variants.
>>>> 
>>>> 
>>>> On 05.10.2018 11:20, Rafal Luzynski wrote:
>>>>> 3.10.2018 11:32 Egor Kobylkin <egor@kobylkin.com> wrote:
>>>>>> 
>>>>>> On 03.10.2018 11:19, Keld Simonsen wrote:
>>>>>>> Hi
>>>>>>> 
>>>>>>> Please note that translitteration of Cyrillic to latin
>>>>>>> is not universal. There are different schemes for for 
>>>>>>> example German, English and Danish, and there is also an 
>>>>>>> ISO standard for it.
>>>>>> 
>>>>>> Thanks for your feedback, Keld!
>>>>>> 
>>>>>> Could the locale maintainers that wouldn't like to include 
>>>>>> this patch explicitly state so here?
>>>>> 
>>>>> I think it is about me so I must reply.  I am sorry about 
>>>>> that and the sole reason is my lack of time.  I'm just a 
>>>>> volunteer here, that means it's not my regular job to work
>>>>> on locale data nor anything in glibc nor in any other open 
>>>>> source project.  I do these things only in my free time
>>>>> which I don't have much.  Of course you will see my
>>>>> contributions here and there but they are either trivial or
>>>>> take me months to complete.  Your patches are on my radar but
>>>>> I can't tell any ETA for them.  Of course, there are other
>>>>> people around here and they are all welcome to come and
>>>>> join.
>>>>> 
>>>>>> That is: - In the case that there is a different preferred 
>>>>>> cyrillic transliteration table for any specific locale 
>>>>>> their maintainers may want to point me to it so I can 
>>>>>> supply a separate table/patch. - Or they could state 
>>>>>> explicitly that for some reason they would like to exclude 
>>>>>> their locale from the patch for a default cyrillic 
>>>>>> transliteration altogether.
>>>>> 
>>>>> As Keld wrote, there are probably separate rules for every 
>>>>> language so I don't think you should treat your rules as 
>>>>> universal and include them in every locale.  At first sight, 
>>>>> it seems to me they work only for English (as a destination 
>>>>> locale).  Also, although it is called "transliteration from 
>>>>> Cyrillic" it seems that it covers only Russian alphabet. What
>>>>> about other languages which use Cyrillic alphabet but add
>>>>> their own diacritic characters?  Think about Belarusian, 
>>>>> Ukrainian, Serbian, Chechen, Chuvash, Mari, Ossetian, Yakut, 
>>>>> Tatar, and more.  What about languages which use Cyrillic 
>>>>> alphabet but transliterate their respective letters in a 
>>>>> different way than Russian?  For example, Russian "Ъ" is (I 
>>>>> think) usually skipped in transliteration, I think you 
>>>>> propose "``", but when transliterating from Bulgarian they 
>>>>> usually transliterate this as "ă".
>>>>> 
>>>>> Few remarks:
>>>>> 
>>>>> * I think you transliterate "щ" as "shh", wouldn't "shch" be 
>>>>> better? * You transliterate "ц" as "cz", wouldn't "ts" be 
>>>>> better?  By the way, in Polish language "cz" is a correct 
>>>>> transliteration of "ч". * You transliterate "й" as "j", this 
>>>>> is fine in many languages but wouldn't "y" be better in 
>>>>> English? * In case of "е": how will you know if it is
>>>>> correct to transliterate it to "e" or "ie" or "je" or "ye"?
>>>>> 
>>>>> These remarks are obviously incomplete, your patch deserves 
>>>>> much more attention to review.
>>>>> 
>>>>> Best regards,
>>>>> 
>>>>> Rafal
>>>>> 
>>>> 
>>> 
>>> 
>> 
> 
>

next prev parent reply	other threads:[~2018-10-05 20:47 UTC|newest]

Thread overview: 111+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com>
     [not found] ` <20180412224352.GB2911@altlinux.org>
2018-07-17 19:34   ` SUBJECT: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] Egor Kobylkin
2018-07-17 19:40     ` Carlos O'Donell
2018-07-17 19:50       ` Egor Kobylkin
2018-07-17 19:59         ` Carlos O'Donell
2018-08-06 19:00   ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29 Egor Kobylkin
2018-10-03  8:26     ` Egor Kobylkin
2018-10-03  9:19       ` Keld Simonsen
2018-10-03  9:32         ` Egor Kobylkin
2018-10-05  8:43           ` Marko Myllynen
2018-10-05  9:20           ` Rafal Luzynski
2018-10-05 10:36             ` Egor Kobylkin
2018-10-08 22:04               ` Rafal Luzynski
2018-10-08 22:52                 ` Egor Kobylkin
2018-10-09 21:43                   ` Rafal Luzynski
2018-10-08 23:20                 ` Zack Weinberg
2018-10-09 15:26                   ` Carlos O'Donell
2018-10-09 21:51                     ` Rafal Luzynski
2018-10-09 16:10                 ` Marko Myllynen
2018-10-09 16:22                   ` Egor Kobylkin
2018-10-09 16:49                     ` Marko Myllynen
2018-10-09 22:08                   ` Rafal Luzynski
2018-10-10 11:21                     ` Marko Myllynen
2018-10-11 10:10                   ` Marko Myllynen
     [not found]             ` <deacdf31-d0bb-a92d-1de3-934d6b4cb158@kobylkin.com>
2018-10-05 11:54               ` Marko Myllynen
2018-10-05 12:00                 ` Egor Kobylkin
2018-10-05 12:21                   ` Marko Myllynen
2018-10-05 20:47                     ` Egor Kobylkin [this message]
2018-10-08 12:40                       ` Marko Myllynen
2018-10-08 22:23                         ` Rafal Luzynski
2018-10-08 23:35                           ` Egor Kobylkin
2018-10-09 13:18                             ` Egor Kobylkin
2018-10-09 18:34                               ` Egor Kobylkin
2018-10-09 22:17                                 ` Rafal Luzynski
2018-10-09 22:40                                   ` Egor Kobylkin
2018-10-09 22:42                                     ` Egor Kobylkin
2018-10-10 11:22                                       ` Marko Myllynen
2018-10-10 12:19                                         ` Egor Kobylkin
2018-10-10 12:34                                           ` Marko Myllynen
2018-10-10 22:29   ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v2 Egor Kobylkin
2018-10-11  9:59     ` Marko Myllynen
2018-10-11 11:04     ` Rafal Luzynski
2018-10-11 13:10       ` Marko Myllynen
2018-10-11 13:50       ` Volodymyr Lisivka
2018-10-11 14:59       ` Egor Kobylkin
2018-10-11 21:30         ` Egor Kobylkin
2018-10-11 15:05       ` Egor Kobylkin
2018-10-11 15:44   ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v3 Egor Kobylkin
2018-10-11 21:33   ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v4 Egor Kobylkin
2018-10-12 14:05   ` [PATCH v5] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] Egor Kobylkin
2018-10-13  0:59     ` Rafal Luzynski
2018-10-13 16:58       ` Egor Kobylkin
2018-10-15 11:04         ` Marko Myllynen
2018-10-15 11:54           ` Egor Kobylkin
2018-10-23 23:08         ` Rafal Luzynski
2018-10-17 14:16   ` [PATCH v6] " Egor Kobylkin
2018-11-01 22:51   ` [PATCH v7] " Egor Kobylkin
2018-11-02  0:00   ` [PATCH v8] " Egor Kobylkin
2018-11-02 22:22     ` Rafal Luzynski
2018-11-02 23:27       ` Egor Kobylkin
2018-11-14 21:25   ` [PATCH v9] " Egor Kobylkin
2018-11-16 22:17     ` Rafal Luzynski
2018-11-17 18:34       ` Egor Kobylkin
2018-11-19  7:13         ` Marko Myllynen
2018-11-19  9:21           ` Egor Kobylkin
2018-11-19 19:35             ` Marko Myllynen
2018-12-01 22:07           ` Rafal Luzynski
2018-12-01 22:53             ` Egor Kobylkin
2018-12-03 22:19             ` Egor Kobylkin
2018-12-08  1:15               ` Rafal Luzynski
2018-12-10 21:20                 ` Marko Myllynen
2018-12-19 22:25                   ` Rafal Luzynski
2018-12-19 22:48                     ` Egor Kobylkin
2018-12-19 23:50                       ` Rafal Luzynski
2018-11-19 11:10   ` [PATCH v10] " Egor Kobylkin
2018-12-07 23:35     ` Rafal Luzynski
2018-12-08 21:51       ` Egor Kobylkin
2018-12-19 22:41         ` Rafal Luzynski
2018-12-19 23:02           ` Egor Kobylkin
2018-12-20  0:05             ` Rafal Luzynski
2018-12-08 22:28   ` [PATCH v11] Locales: Cyrillic -> ASCII transliteration " Egor Kobylkin
2018-12-19 23:16     ` Egor Kobylkin
2018-12-26 10:07       ` Siddhesh Poyarekar
2018-12-26 12:13         ` Egor Kobylkin
2018-12-27  1:30           ` Siddhesh Poyarekar
2018-12-27 11:28             ` Rafal Luzynski
2019-01-02 18:38   ` [PATCH v12] " Egor Kobylkin
2019-01-05 14:35     ` Rafal Luzynski
2019-01-05 21:12       ` Egor Kobylkin
2019-01-07 20:37         ` Marko Myllynen
2019-01-09  0:46           ` Egor Kobylkin
2019-01-09 20:03             ` Marko Myllynen
2019-02-04  7:14               ` [PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] ping for 2.30 Egor Kobylkin
2019-02-14 16:48                 ` Marko Myllynen
2019-03-04 22:11                   ` Egor Kobylkin
2019-03-11 13:59                     ` PING " Egor Kobylkin
2019-03-14 19:48                       ` Egor Kobylkin
2019-04-19 22:24                   ` Rafal Luzynski
     [not found]                     ` <5ELixS9SQ0DW4mlvswp96ASpLobBabU9KQ6zOTH-Udrb34mABhcqiPERpBZfPWZ9F77s8XNmiLIAq9UWu0AjLFFdjOz_FZVU5_xF-SiQkrw=@kobylkin.com>
2019-04-27  2:51                       ` Siddhesh Poyarekar
2019-04-27  7:34                         ` Diego (Egor) Kobylkin
2019-04-09  1:04     ` [PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] Carlos O'Donell
2019-03-19 10:39   ` ping " Egor Kobylkin
2019-03-28 16:20     ` [PING^4][PATCH " Marko Myllynen
2019-04-04 19:44     ` [PING^5][PATCH " Egor Kobylkin
2019-04-06  1:36       ` Siddhesh Poyarekar
2019-04-16  7:15     ` [PING^6][PATCH " Marko Myllynen
2019-04-16 13:17       ` Carlos O'Donell
2019-04-16 17:06         ` Egor Kobylkin
2019-04-16 17:58           ` Carlos O'Donell
2019-04-16 18:41             ` Egor Kobylkin
2019-04-16 19:06               ` Carlos O'Donell
2019-05-10 12:19                 ` Marko Myllynen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/libc/involved.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b8f02fe9-f911-487f-b50b-9b0c43191cb6@kobylkin.com \
    --to=egor@kobylkin.com \
    --cc=carlos@redhat.com \
    --cc=danilo@gnome.org \
    --cc=digitalfreak@lingonborough.com \
    --cc=keld@keldix.com \
    --cc=ldv@altlinux.org \
    --cc=libc-alpha@sourceware.org \
    --cc=libc-locales@sourceware.org \
    --cc=mkutny@gmail.com \
    --cc=myllynen@redhat.com \
    --cc=vlisivka@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).