From: Egor Kobylkin <egor@kobylkin.com>
To: Marko Myllynen <myllynen@redhat.com>,
Rafal Luzynski <digitalfreak@lingonborough.com>,
libc-alpha@sourceware.org, libc-locales@sourceware.org
Cc: mfabian@redhat.com, "Dmitry V. Levin" <ldv@altlinux.org>,
Volodymyr Lisivka <vlisivka@gmail.com>,
Max Kutny <mkutny@gmail.com>,
danilo@gnome.org
Subject: Re: [PATCH v5] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
Date: Mon, 15 Oct 2018 13:54:53 +0200 [thread overview]
Message-ID: <8206061b-4e9c-b366-85a4-93ef61687ca0@kobylkin.com> (raw)
In-Reply-To: <1374aef3-4c16-b9cd-49a6-b6da9b1a9eeb@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 2336 bytes --]
On 15.10.2018 13:04, Marko Myllynen wrote:
> Hi,
>
> On 2018-10-13 19:58, Egor Kobylkin wrote:
>> On 13.10.2018 02:59, Rafal Luzynski wrote:
>>
>>> Regarding the tests, I think there is no complete transliteration
>>> test suite at the moment. Probably the only test is
>>> localedata/bug-iconv-trans.c. You can also see the collation tests
>>> placed in the same directory, they use those multiple *.UTF-8.in
>>> files.
>>>
>>> You can skip the tests for now.
>>
>> First I though they could just be added but not all locales
>> transliterate Umlauts so just extending the current test won't do as it
>> will fail for those locales.
>
> I still think a one-time check against uconv(1) (part of Unicode's ICU
> project) for discrepancies.
Just an addition. I have changes a few constants to see whether
localedata/bug-iconv-trans.c could be made to test cyrillic. Attached is
the bug-iconv-trans-cyr.c that goes through in this form. I had to save
it as UTF-8 instead of ISO-8859-15 for localedata/bug-iconv-trans.c.
>>>> [...] diff -uNr a/localedata/locales/am_ET
>>>> b/localedata/locales/am_ET --- a/localedata/locales/am_ET
>>>> 2018-10-11 15:10:11.000000000 +0000 +++ b/localedata/locales/am_ET
>>>> 2018-10-11 15:10:43.000000000 +0000 @@ -1394,6 +1394,7 @@ <U137A>
>>>> <U0060><U0039><U0030> <U137B> <U0060><U0031><U0030><U0030> <U137C>
>>>> <U0060><U0031><U0030><U0030><U0030><U0030> +include
>>>> "translit_cyrillic";"" translit_end % END LC_CTYPE
>>>
>>> Shouldn't “include "translit_cyrillic";""” be placed before the
>>> custom rules, together with other includes? The same in more files,
>>> I will not mention them all.
>>
>> If I recall correctly it is because of the
>> "translit_end
>> END LC_CTYPE"
>> part at the end of the translit_cyrillic. This way it works for any
>> locale, regardless whether it has translit itself or not. And being at
>> the end it does not supersede any previous transliteration that may be
>> there for a reason.
>
> I suspect one problem would be that the latter rule wins, so if there
> are some locale-specific rules than possible translit_* inclusions would
> override them if not included before the locale-specific rules.
What is the best way forward here? Can somebody make an explicit
suggestion on how to change the current approach if needed?
Bests,
Egor
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: bug-iconv-trans-cyr.c --]
[-- Type: text/x-csrc; name="bug-iconv-trans-cyr.c", Size: 2207 bytes --]
#include <iconv.h>
#include <locale.h>
#include <stdio.h>
#include <string.h>
int
main (void)
{
iconv_t cd;
const char str[] = "CyrillicLetters_ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’";
const char expected[] = "CyrillicLetters_YODJG`YEZ`IYIJL`N`TSHK`U`DHABVGDEZHZIJKLMNOPRSTUUFHCCHSHSHHA`Y`E`YUYAabvgdezhzijklmnoprstuufhcchshshh``y`e`yuyayodjg`yez`iyijl`n`tshk`u`dhO`o`FHfhYHyhE`e`G`g`GHghGHghZH`zh`K`k`K`k`N`n`NGngP`p`O`o`C`C`T`t`UuH`h`TCZtczSH`SH`CH`ch`CH`ch`iZH`zh`CH`ch`A`a`A`a`E`e`A`a`ZH`zh`Z`z`Z`z`I`i`O`o`O`o`U`u`U`u`CH`ch`Y`y`'";
char *inptr = (char *) str;
size_t inlen = strlen (str) + 1;
char outbuf[500];
char *outptr = outbuf;
size_t outlen = sizeof (outbuf);
int result = 0;
size_t n;
if (setlocale (LC_ALL, "de_DE.UTF-8") == NULL)
{
puts ("setlocale failed");
return 1;
}
cd = iconv_open ("ANSI_X3.4-1968//TRANSLIT", "UTF-8");
if (cd == (iconv_t) -1)
{
puts ("iconv_open failed");
return 1;
}
n = iconv (cd, &inptr, &inlen, &outptr, &outlen);
if (n != 174)
{
if (n == (size_t) -1)
printf ("iconv() returned error: %m\n");
else
printf ("iconv() returned %Zd, expected 7\n", n);
result = 1;
}
if (inlen != 0)
{
puts ("not all input consumed");
result = 1;
}
else if (inptr - str != strlen (str) + 1)
{
printf ("inptr wrong, advanced by %td\n", inptr - str);
result = 1;
}
if (memcmp (outbuf, expected, sizeof (expected)) != 0)
{
printf ("result wrong: \"%.*s\", expected: \"%s\"\n",
(int) (sizeof (outbuf) - outlen), outbuf, expected);
result = 1;
}
else if (outlen != sizeof (outbuf) - sizeof (expected))
{
printf ("outlen wrong: %Zd, expected %Zd\n", outlen,
sizeof (outbuf) - 15);
result = 1;
}
else
printf ("output is \"%s\" which is OK\n", outbuf);
return result;
}
next prev parent reply other threads:[~2018-10-15 11:55 UTC|newest]
Thread overview: 111+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com>
[not found] ` <20180412224352.GB2911@altlinux.org>
2018-07-17 19:34 ` SUBJECT: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] Egor Kobylkin
2018-07-17 19:40 ` Carlos O'Donell
2018-07-17 19:50 ` Egor Kobylkin
2018-07-17 19:59 ` Carlos O'Donell
2018-08-06 19:00 ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29 Egor Kobylkin
2018-10-03 8:26 ` Egor Kobylkin
2018-10-03 9:19 ` Keld Simonsen
2018-10-03 9:32 ` Egor Kobylkin
2018-10-05 8:43 ` Marko Myllynen
2018-10-05 9:20 ` Rafal Luzynski
2018-10-05 10:36 ` Egor Kobylkin
2018-10-08 22:04 ` Rafal Luzynski
2018-10-08 22:52 ` Egor Kobylkin
2018-10-09 21:43 ` Rafal Luzynski
2018-10-08 23:20 ` Zack Weinberg
2018-10-09 15:26 ` Carlos O'Donell
2018-10-09 21:51 ` Rafal Luzynski
2018-10-09 16:10 ` Marko Myllynen
2018-10-09 16:22 ` Egor Kobylkin
2018-10-09 16:49 ` Marko Myllynen
2018-10-09 22:08 ` Rafal Luzynski
2018-10-10 11:21 ` Marko Myllynen
2018-10-11 10:10 ` Marko Myllynen
[not found] ` <deacdf31-d0bb-a92d-1de3-934d6b4cb158@kobylkin.com>
2018-10-05 11:54 ` Marko Myllynen
2018-10-05 12:00 ` Egor Kobylkin
2018-10-05 12:21 ` Marko Myllynen
2018-10-05 20:47 ` Egor Kobylkin
2018-10-08 12:40 ` Marko Myllynen
2018-10-08 22:23 ` Rafal Luzynski
2018-10-08 23:35 ` Egor Kobylkin
2018-10-09 13:18 ` Egor Kobylkin
2018-10-09 18:34 ` Egor Kobylkin
2018-10-09 22:17 ` Rafal Luzynski
2018-10-09 22:40 ` Egor Kobylkin
2018-10-09 22:42 ` Egor Kobylkin
2018-10-10 11:22 ` Marko Myllynen
2018-10-10 12:19 ` Egor Kobylkin
2018-10-10 12:34 ` Marko Myllynen
2018-10-10 22:29 ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v2 Egor Kobylkin
2018-10-11 9:59 ` Marko Myllynen
2018-10-11 11:04 ` Rafal Luzynski
2018-10-11 13:10 ` Marko Myllynen
2018-10-11 13:50 ` Volodymyr Lisivka
2018-10-11 14:59 ` Egor Kobylkin
2018-10-11 21:30 ` Egor Kobylkin
2018-10-11 15:05 ` Egor Kobylkin
2018-10-11 15:44 ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v3 Egor Kobylkin
2018-10-11 21:33 ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v4 Egor Kobylkin
2018-10-12 14:05 ` [PATCH v5] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] Egor Kobylkin
2018-10-13 0:59 ` Rafal Luzynski
2018-10-13 16:58 ` Egor Kobylkin
2018-10-15 11:04 ` Marko Myllynen
2018-10-15 11:54 ` Egor Kobylkin [this message]
2018-10-23 23:08 ` Rafal Luzynski
2018-10-17 14:16 ` [PATCH v6] " Egor Kobylkin
2018-11-01 22:51 ` [PATCH v7] " Egor Kobylkin
2018-11-02 0:00 ` [PATCH v8] " Egor Kobylkin
2018-11-02 22:22 ` Rafal Luzynski
2018-11-02 23:27 ` Egor Kobylkin
2018-11-14 21:25 ` [PATCH v9] " Egor Kobylkin
2018-11-16 22:17 ` Rafal Luzynski
2018-11-17 18:34 ` Egor Kobylkin
2018-11-19 7:13 ` Marko Myllynen
2018-11-19 9:21 ` Egor Kobylkin
2018-11-19 19:35 ` Marko Myllynen
2018-12-01 22:07 ` Rafal Luzynski
2018-12-01 22:53 ` Egor Kobylkin
2018-12-03 22:19 ` Egor Kobylkin
2018-12-08 1:15 ` Rafal Luzynski
2018-12-10 21:20 ` Marko Myllynen
2018-12-19 22:25 ` Rafal Luzynski
2018-12-19 22:48 ` Egor Kobylkin
2018-12-19 23:50 ` Rafal Luzynski
2018-11-19 11:10 ` [PATCH v10] " Egor Kobylkin
2018-12-07 23:35 ` Rafal Luzynski
2018-12-08 21:51 ` Egor Kobylkin
2018-12-19 22:41 ` Rafal Luzynski
2018-12-19 23:02 ` Egor Kobylkin
2018-12-20 0:05 ` Rafal Luzynski
2018-12-08 22:28 ` [PATCH v11] Locales: Cyrillic -> ASCII transliteration " Egor Kobylkin
2018-12-19 23:16 ` Egor Kobylkin
2018-12-26 10:07 ` Siddhesh Poyarekar
2018-12-26 12:13 ` Egor Kobylkin
2018-12-27 1:30 ` Siddhesh Poyarekar
2018-12-27 11:28 ` Rafal Luzynski
2019-01-02 18:38 ` [PATCH v12] " Egor Kobylkin
2019-01-05 14:35 ` Rafal Luzynski
2019-01-05 21:12 ` Egor Kobylkin
2019-01-07 20:37 ` Marko Myllynen
2019-01-09 0:46 ` Egor Kobylkin
2019-01-09 20:03 ` Marko Myllynen
2019-02-04 7:14 ` [PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] ping for 2.30 Egor Kobylkin
2019-02-14 16:48 ` Marko Myllynen
2019-03-04 22:11 ` Egor Kobylkin
2019-03-11 13:59 ` PING " Egor Kobylkin
2019-03-14 19:48 ` Egor Kobylkin
2019-04-19 22:24 ` Rafal Luzynski
[not found] ` <5ELixS9SQ0DW4mlvswp96ASpLobBabU9KQ6zOTH-Udrb34mABhcqiPERpBZfPWZ9F77s8XNmiLIAq9UWu0AjLFFdjOz_FZVU5_xF-SiQkrw=@kobylkin.com>
2019-04-27 2:51 ` Siddhesh Poyarekar
2019-04-27 7:34 ` Diego (Egor) Kobylkin
2019-04-09 1:04 ` [PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] Carlos O'Donell
2019-03-19 10:39 ` ping " Egor Kobylkin
2019-03-28 16:20 ` [PING^4][PATCH " Marko Myllynen
2019-04-04 19:44 ` [PING^5][PATCH " Egor Kobylkin
2019-04-06 1:36 ` Siddhesh Poyarekar
2019-04-16 7:15 ` [PING^6][PATCH " Marko Myllynen
2019-04-16 13:17 ` Carlos O'Donell
2019-04-16 17:06 ` Egor Kobylkin
2019-04-16 17:58 ` Carlos O'Donell
2019-04-16 18:41 ` Egor Kobylkin
2019-04-16 19:06 ` Carlos O'Donell
2019-05-10 12:19 ` Marko Myllynen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/libc/involved.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8206061b-4e9c-b366-85a4-93ef61687ca0@kobylkin.com \
--to=egor@kobylkin.com \
--cc=danilo@gnome.org \
--cc=digitalfreak@lingonborough.com \
--cc=ldv@altlinux.org \
--cc=libc-alpha@sourceware.org \
--cc=libc-locales@sourceware.org \
--cc=mfabian@redhat.com \
--cc=mkutny@gmail.com \
--cc=myllynen@redhat.com \
--cc=vlisivka@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).