unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Rafal Luzynski <digitalfreak@lingonborough.com>
To: Egor Kobylkin <egor@kobylkin.com>,
	libc-alpha@sourceware.org, libc-locales@sourceware.org,
	mfabian@redhat.com, Marko Myllynen <myllynen@redhat.com>,
	"Dmitry V. Levin" <ldv@altlinux.org>
Cc: Volodymyr Lisivka <vlisivka@gmail.com>,
	Max Kutny <mkutny@gmail.com>,
	danilo@gnome.org
Subject: Re: [PATCH v5] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
Date: Sat, 13 Oct 2018 02:59:17 +0200 (CEST)	[thread overview]
Message-ID: <165238610.582597.1539392357757@poczta.nazwa.pl> (raw)
In-Reply-To: <d5582688-819b-90c2-3f4a-0d19c932d487@kobylkin.com>

Egor,

Thank you for the update.  I took a closer look at your patch so this
time my review is more complete than before although not yet fully complete.

As far as I understand, ISO-9 and its GOST variants are meant to be
universal rather than Russian-specific.  Therefore it is correct to place
them in the external file, like translit_cyrillic, and then include this
file in other locales adding locale specific modifications, if required.
For example, if there are any Russian-specific rules not included in this
file, they should go to ru_RU.

The text of the ISO-9 standard is not available in public, have we got
anything better than an article in Wikipedia?

Regarding the format of your commit message, I hesitate to say anything
more because there are more experienced maintainers around here.  Please
take a look at the Contribution Checklist. [1]

While at this, what is your legal relationship with GLIBC project?  Have
you signed the FSF Copyright Assignment?  It is not necessary for the locale
data but it might be necessary if you are going to contribute the testing code.

Regarding the tests, I think there is no complete transliteration test
suite at the moment.  Probably the only test is localedata/bug-iconv-trans.c.
You can also see the collation tests placed in the same directory, they
use those multiple *.UTF-8.in files.

You can skip the tests for now.

Technical issue:  Please either attach your patch to the email message or
paste it inline, not both.  The patch as it is now is not applicable.
I had to edit it manually to apply.


12.10.2018 16:05 Egor Kobylkin <egor@kobylkin.com> wrote:
> [...]
> From this patch I have excluded locales that already mention cyrillic or
> have a transliteration table for it:
> az_AZ
> iso14651_t1_common
> ky_KG
> mn_MN
> sr_RS
> tg_TJ
> tk_TM
> tt_RU
> uk_UA
> uz_UZ
> uz_UZ@cyrillic

I confirm that these locales are excluded and there are no other missing
locales.

> [...]
>
> diff -uNr a/localedata/locales/C b/localedata/locales/C
> --- a/localedata/locales/C 2018-10-11 15:10:12.000000000 +0000
> +++ b/localedata/locales/C 2018-10-11 15:10:43.000000000 +0000

There is no such file.  Where have you got the source code from?  Are you
sure this is glibc? :-)

> [...]
> diff -uNr a/localedata/locales/am_ET b/localedata/locales/am_ET
> --- a/localedata/locales/am_ET 2018-10-11 15:10:11.000000000 +0000
> +++ b/localedata/locales/am_ET 2018-10-11 15:10:43.000000000 +0000
> @@ -1394,6 +1394,7 @@
> <U137A> <U0060><U0039><U0030>
> <U137B> <U0060><U0031><U0030><U0030>
> <U137C> <U0060><U0031><U0030><U0030><U0030><U0030>
> +include "translit_cyrillic";""
> translit_end
> %
> END LC_CTYPE

Shouldn't “include "translit_cyrillic";""” be placed before the custom rules,
together with other includes?  The same in more files, I will not mention
them all.

> [...]
> diff -uNr a/localedata/locales/sd_IN@devanagari
> b/localedata/locales/sd_IN@devanagari
> --- a/localedata/locales/sd_IN@devanagari 2018-10-11 15:10:18.000000000
> +0000
> +++ b/localedata/locales/sd_IN@devanagari 2018-10-11 15:10:49.000000000
> +0000

Those 3 lines have been broken by the email agent, the patch is not applicable.

> [...]
> diff -uNr a/localedata/locales/sd_PK b/localedata/locales/sd_PK
> --- a/localedata/locales/sd_PK 2018-10-11 15:10:18.000000000 +0000
> +++ b/localedata/locales/sd_PK 2018-10-11 15:10:49.000000000 +0000

There is no such file in glibc.

> [...]
> diff -uNr a/localedata/locales/translit_cyrillic
> b/localedata/locales/translit_cyrillic
> --- a/localedata/locales/translit_cyrillic 1970-01-01 00:00:00.000000000
> +0000
> +++ b/localedata/locales/translit_cyrillic 2018-10-11 15:10:52.000000000
> +0000

Again 3 lines broken, the patch is not applicable.

> [...]
> +% Contributions welcome for the rest of Cyrillic script in Unicode
> +% https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode.

I am still tempted to add more Cyrillic characters but I understand
that it must be clearly separated which transliteration rules come from
ISO-9 and which are our own invention.  But that's not for now.

> [...]
> +translit_start
> +
> +% CYRILLIC CAPITAL LETTER IO
> +<U0401> <U00CB>;"<U0059><U004F>"

This says that for ASCII (GOST 7.79 System B) you would like to transliterate
"Ё" as "YO" but the table in Wikipedia says "Yo".  I understand that one or
another may be correct depending on the context but we should be consistent
and also better let's stick with the standard.

> +% CYRILLIC CAPITAL LETTER DJE
> +<U0402> <U0110>;"<U0044><U004A>"

This says "DJ" but System B does not mention it.  Where does it come from?
Also, I think it should be "Dj" rather than "DJ".

> +% CYRILLIC CAPITAL LETTER GJE
> +<U0403> <U01F4>;"<U0047><U0060>"

Correct, according to both systems.

> +% CYRILLIC CAPITAL LETTER UKRAINIAN IE
> +<U0404> <U00CA>;"<U0059><U0065>"

"Ye" - correct.

> +% CYRILLIC CAPITAL LETTER DZE
> +<U0405> <U1E90>;"<U005A><U0060>"

Correct.

> +% CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
> +<U0406> <U00CC>;<U0049>

Correct.  The table mentions an alternative transliteration "I`" but
says that it is "only before vowels for Old Russian and Old Bulgarian".
I think we can skip this other variant.

> +% CYRILLIC CAPITAL LETTER YI
> +<U0407> <U00CF>;"<U0059><U0069>"

"Yi" - correct.

> +% CYRILLIC CAPITAL LETTER JE
> +<U0408> "<U004A><U030C>";<U004A>

Correct.

> +% CYRILLIC CAPITAL LETTER LJE
> +<U0409> "<U004C><U0302>";"<U004C><U0060>"

Correct, according to the standard.  If Serbian language requires "Lj"
then overrides should go to sr_RS file.

> +% CYRILLIC CAPITAL LETTER NJE
> +<U040A> "<U004E><U0302>";"<U004E><U0060>"

Correct, the same comment.

> +% CYRILLIC CAPITAL LETTER TSHE
> +<U040B> <U0106>;"<U0054><U0053><U0048>"

Where does "TSH" come from?  It is not mentioned by the System B table.
Also I am afraid this is not correct.

> +% CYRILLIC CAPITAL LETTER KJE
> +<U040C> <U1E30>;"<U004B><U0060>"

Correct.

> +% CYRILLIC CAPITAL LETTER SHORT U
> +<U040E> <U016C>;"<U0055><U0060>"

"U`" - correct.

> +% CYRILLIC CAPITAL LETTER DZHE
> +<U040F> "<U0044><U0302>";"<U0044><U0068>"

"Dh" - correct.

> [...]
> +% CYRILLIC CAPITAL LETTER ZHE
> +<U0416> <U017D>;"<U005A><U0048>"

"ZH" - shouldn't be "Zh"?

> [...]
> +% CYRILLIC UNDEFINED
> +<U0423><U0301> <U00DA>;"<U0055><U0060>"

1. I think it should be named "CYRILLIC CAPITAL LETTER U WITH ACUTE".
2. OK, the System A table mentions this letter but System B does not.
   Somehow we should handle it.  I think that "U`" is the best we can
   do for now.
3. It must be tested whether this actually works.

> [...]
> +% CYRILLIC CAPITAL LETTER HA
> +<U0425> <U0048>;<U0058>

I don't think that "H" is unavailable in any encoding therefore it will
always be transliterated as "H" and never as "X".  We can't help it and
I don't think it is bad.

> +% CYRILLIC CAPITAL LETTER TSE
> +<U0426> <U0043>;"<U0043><U005A>"

1. "CZ" - maybe should be "Cz"?
2. Are we able to implement the rule: "c before i, e, y, j"?

> +% CYRILLIC CAPITAL LETTER CHE
> +<U0427> <U010C>;"<U0043><U0048>"

"CH" -> "Ch"?

> +% CYRILLIC CAPITAL LETTER SHA
> +<U0428> <U0160>;"<U0053><U0048>"

"SH" -> "Sh"?

> +% CYRILLIC CAPITAL LETTER SHCHA
> +<U0429> <U015C>;"<U0053><U0048><U0048>"

"SHH" -> "Shh"?

> +% CYRILLIC CAPITAL LETTER HARD SIGN
> +<U042A> <U02BA>;"<U0041><U0060>"

"A`" is only for Bulgarian and should go to bg_BG.  How should
we transliterate an upper case hard sign to plain ASCII?  I think
that just "``", same as lower case.

> +% CYRILLIC CAPITAL LETTER YERU
> +<U042B> <U0059>;"<U0059><U0060>"

Again, as "Y" is always available it will never be transliterated
as "Y`".

> +% CYRILLIC CAPITAL LETTER SOFT SIGN
> +<U042C> <U02B9>;<U0060>

OK, I like it to be transliterated to plain ASCII as "`".

> +% CYRILLIC CAPITAL LETTER E
> +<U042D> <U00C8>;"<U0045><U0060>"

OK

> +% CYRILLIC CAPITAL LETTER YU
> +<U042E> <U00DB>;"<U0059><U0055>"

"YU" -> "Yu"?

> +% CYRILLIC CAPITAL LETTER YA
> +<U042F> <U00C2>;"<U0059><U0041>"

"YA" -> "Ya"?

> [...]

I am sorry, this is of course incomplete but that's enough for tonight.

Regards,

Rafal


[1] https://sourceware.org/glibc/wiki/Contribution%20checklist

  reply	other threads:[~2018-10-13  1:01 UTC|newest]

Thread overview: 111+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com>
     [not found] ` <20180412224352.GB2911@altlinux.org>
2018-07-17 19:34   ` SUBJECT: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] Egor Kobylkin
2018-07-17 19:40     ` Carlos O'Donell
2018-07-17 19:50       ` Egor Kobylkin
2018-07-17 19:59         ` Carlos O'Donell
2018-08-06 19:00   ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29 Egor Kobylkin
2018-10-03  8:26     ` Egor Kobylkin
2018-10-03  9:19       ` Keld Simonsen
2018-10-03  9:32         ` Egor Kobylkin
2018-10-05  8:43           ` Marko Myllynen
2018-10-05  9:20           ` Rafal Luzynski
2018-10-05 10:36             ` Egor Kobylkin
2018-10-08 22:04               ` Rafal Luzynski
2018-10-08 22:52                 ` Egor Kobylkin
2018-10-09 21:43                   ` Rafal Luzynski
2018-10-08 23:20                 ` Zack Weinberg
2018-10-09 15:26                   ` Carlos O'Donell
2018-10-09 21:51                     ` Rafal Luzynski
2018-10-09 16:10                 ` Marko Myllynen
2018-10-09 16:22                   ` Egor Kobylkin
2018-10-09 16:49                     ` Marko Myllynen
2018-10-09 22:08                   ` Rafal Luzynski
2018-10-10 11:21                     ` Marko Myllynen
2018-10-11 10:10                   ` Marko Myllynen
     [not found]             ` <deacdf31-d0bb-a92d-1de3-934d6b4cb158@kobylkin.com>
2018-10-05 11:54               ` Marko Myllynen
2018-10-05 12:00                 ` Egor Kobylkin
2018-10-05 12:21                   ` Marko Myllynen
2018-10-05 20:47                     ` Egor Kobylkin
2018-10-08 12:40                       ` Marko Myllynen
2018-10-08 22:23                         ` Rafal Luzynski
2018-10-08 23:35                           ` Egor Kobylkin
2018-10-09 13:18                             ` Egor Kobylkin
2018-10-09 18:34                               ` Egor Kobylkin
2018-10-09 22:17                                 ` Rafal Luzynski
2018-10-09 22:40                                   ` Egor Kobylkin
2018-10-09 22:42                                     ` Egor Kobylkin
2018-10-10 11:22                                       ` Marko Myllynen
2018-10-10 12:19                                         ` Egor Kobylkin
2018-10-10 12:34                                           ` Marko Myllynen
2018-10-10 22:29   ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v2 Egor Kobylkin
2018-10-11  9:59     ` Marko Myllynen
2018-10-11 11:04     ` Rafal Luzynski
2018-10-11 13:10       ` Marko Myllynen
2018-10-11 13:50       ` Volodymyr Lisivka
2018-10-11 14:59       ` Egor Kobylkin
2018-10-11 21:30         ` Egor Kobylkin
2018-10-11 15:05       ` Egor Kobylkin
2018-10-11 15:44   ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v3 Egor Kobylkin
2018-10-11 21:33   ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v4 Egor Kobylkin
2018-10-12 14:05   ` [PATCH v5] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] Egor Kobylkin
2018-10-13  0:59     ` Rafal Luzynski [this message]
2018-10-13 16:58       ` Egor Kobylkin
2018-10-15 11:04         ` Marko Myllynen
2018-10-15 11:54           ` Egor Kobylkin
2018-10-23 23:08         ` Rafal Luzynski
2018-10-17 14:16   ` [PATCH v6] " Egor Kobylkin
2018-11-01 22:51   ` [PATCH v7] " Egor Kobylkin
2018-11-02  0:00   ` [PATCH v8] " Egor Kobylkin
2018-11-02 22:22     ` Rafal Luzynski
2018-11-02 23:27       ` Egor Kobylkin
2018-11-14 21:25   ` [PATCH v9] " Egor Kobylkin
2018-11-16 22:17     ` Rafal Luzynski
2018-11-17 18:34       ` Egor Kobylkin
2018-11-19  7:13         ` Marko Myllynen
2018-11-19  9:21           ` Egor Kobylkin
2018-11-19 19:35             ` Marko Myllynen
2018-12-01 22:07           ` Rafal Luzynski
2018-12-01 22:53             ` Egor Kobylkin
2018-12-03 22:19             ` Egor Kobylkin
2018-12-08  1:15               ` Rafal Luzynski
2018-12-10 21:20                 ` Marko Myllynen
2018-12-19 22:25                   ` Rafal Luzynski
2018-12-19 22:48                     ` Egor Kobylkin
2018-12-19 23:50                       ` Rafal Luzynski
2018-11-19 11:10   ` [PATCH v10] " Egor Kobylkin
2018-12-07 23:35     ` Rafal Luzynski
2018-12-08 21:51       ` Egor Kobylkin
2018-12-19 22:41         ` Rafal Luzynski
2018-12-19 23:02           ` Egor Kobylkin
2018-12-20  0:05             ` Rafal Luzynski
2018-12-08 22:28   ` [PATCH v11] Locales: Cyrillic -> ASCII transliteration " Egor Kobylkin
2018-12-19 23:16     ` Egor Kobylkin
2018-12-26 10:07       ` Siddhesh Poyarekar
2018-12-26 12:13         ` Egor Kobylkin
2018-12-27  1:30           ` Siddhesh Poyarekar
2018-12-27 11:28             ` Rafal Luzynski
2019-01-02 18:38   ` [PATCH v12] " Egor Kobylkin
2019-01-05 14:35     ` Rafal Luzynski
2019-01-05 21:12       ` Egor Kobylkin
2019-01-07 20:37         ` Marko Myllynen
2019-01-09  0:46           ` Egor Kobylkin
2019-01-09 20:03             ` Marko Myllynen
2019-02-04  7:14               ` [PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] ping for 2.30 Egor Kobylkin
2019-02-14 16:48                 ` Marko Myllynen
2019-03-04 22:11                   ` Egor Kobylkin
2019-03-11 13:59                     ` PING " Egor Kobylkin
2019-03-14 19:48                       ` Egor Kobylkin
2019-04-19 22:24                   ` Rafal Luzynski
     [not found]                     ` <5ELixS9SQ0DW4mlvswp96ASpLobBabU9KQ6zOTH-Udrb34mABhcqiPERpBZfPWZ9F77s8XNmiLIAq9UWu0AjLFFdjOz_FZVU5_xF-SiQkrw=@kobylkin.com>
2019-04-27  2:51                       ` Siddhesh Poyarekar
2019-04-27  7:34                         ` Diego (Egor) Kobylkin
2019-04-09  1:04     ` [PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] Carlos O'Donell
2019-03-19 10:39   ` ping " Egor Kobylkin
2019-03-28 16:20     ` [PING^4][PATCH " Marko Myllynen
2019-04-04 19:44     ` [PING^5][PATCH " Egor Kobylkin
2019-04-06  1:36       ` Siddhesh Poyarekar
2019-04-16  7:15     ` [PING^6][PATCH " Marko Myllynen
2019-04-16 13:17       ` Carlos O'Donell
2019-04-16 17:06         ` Egor Kobylkin
2019-04-16 17:58           ` Carlos O'Donell
2019-04-16 18:41             ` Egor Kobylkin
2019-04-16 19:06               ` Carlos O'Donell
2019-05-10 12:19                 ` Marko Myllynen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/libc/involved.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=165238610.582597.1539392357757@poczta.nazwa.pl \
    --to=digitalfreak@lingonborough.com \
    --cc=danilo@gnome.org \
    --cc=egor@kobylkin.com \
    --cc=ldv@altlinux.org \
    --cc=libc-alpha@sourceware.org \
    --cc=libc-locales@sourceware.org \
    --cc=mfabian@redhat.com \
    --cc=mkutny@gmail.com \
    --cc=myllynen@redhat.com \
    --cc=vlisivka@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).