unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Egor Kobylkin <egor@kobylkin.com>
To: Marko Myllynen <myllynen@redhat.com>,
	Rafal Luzynski <digitalfreak@lingonborough.com>,
	libc-alpha@sourceware.org, libc-locales@sourceware.org
Cc: mfabian@redhat.com, "Dmitry V. Levin" <ldv@altlinux.org>,
	Volodymyr Lisivka <vlisivka@gmail.com>,
	Max Kutny <mkutny@gmail.com>,
	danilo@gnome.org
Subject: Re: [PATCH v5] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
Date: Mon, 15 Oct 2018 13:54:53 +0200	[thread overview]
Message-ID: <8206061b-4e9c-b366-85a4-93ef61687ca0@kobylkin.com> (raw)
In-Reply-To: <1374aef3-4c16-b9cd-49a6-b6da9b1a9eeb@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 2336 bytes --]

On 15.10.2018 13:04, Marko Myllynen wrote:
> Hi,
> 
> On 2018-10-13 19:58, Egor Kobylkin wrote:
>> On 13.10.2018 02:59, Rafal Luzynski wrote:
>>
>>> Regarding the tests, I think there is no complete transliteration 
>>> test suite at the moment.  Probably the only test is 
>>> localedata/bug-iconv-trans.c. You can also see the collation tests 
>>> placed in the same directory, they use those multiple *.UTF-8.in 
>>> files.
>>>
>>> You can skip the tests for now.
>>
>> First I though they could just be added but not all locales
>> transliterate Umlauts so just extending the current test won't do as it
>> will fail for those locales.
> 
> I still think a one-time check against uconv(1) (part of Unicode's ICU
> project) for discrepancies.

Just an addition. I have changes a few constants to see whether
localedata/bug-iconv-trans.c could be made to test cyrillic. Attached is
the bug-iconv-trans-cyr.c that goes through in this form. I had to save
it as UTF-8 instead of ISO-8859-15 for localedata/bug-iconv-trans.c.

>>>> [...] diff -uNr a/localedata/locales/am_ET 
>>>> b/localedata/locales/am_ET --- a/localedata/locales/am_ET 
>>>> 2018-10-11 15:10:11.000000000 +0000 +++ b/localedata/locales/am_ET 
>>>> 2018-10-11 15:10:43.000000000 +0000 @@ -1394,6 +1394,7 @@ <U137A> 
>>>> <U0060><U0039><U0030> <U137B> <U0060><U0031><U0030><U0030> <U137C> 
>>>> <U0060><U0031><U0030><U0030><U0030><U0030> +include 
>>>> "translit_cyrillic";"" translit_end % END LC_CTYPE
>>>
>>> Shouldn't “include "translit_cyrillic";""” be placed before the 
>>> custom rules, together with other includes?  The same in more files, 
>>> I will not mention them all.
>>
>> If I recall correctly it is because of the
>> "translit_end
>> END LC_CTYPE"
>> part at the end of the translit_cyrillic. This way it works for any
>> locale, regardless whether it has translit itself or not. And being at
>> the end it does not supersede any previous transliteration that may be
>> there for a reason.
> 
> I suspect one problem would be that the latter rule wins, so if there
> are some locale-specific rules than possible translit_* inclusions would
> override them if not included before the locale-specific rules.

What is the best way forward here? Can somebody make an explicit
suggestion on how to change the current approach if needed?

Bests,
Egor


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: bug-iconv-trans-cyr.c --]
[-- Type: text/x-csrc; name="bug-iconv-trans-cyr.c", Size: 2207 bytes --]

#include <iconv.h>
#include <locale.h>
#include <stdio.h>
#include <string.h>

int
main (void)
{
  iconv_t cd;
  const char str[] = "CyrillicLetters_ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’";
  const char expected[] = "CyrillicLetters_YODJG`YEZ`IYIJL`N`TSHK`U`DHABVGDEZHZIJKLMNOPRSTUUFHCCHSHSHHA`Y`E`YUYAabvgdezhzijklmnoprstuufhcchshshh``y`e`yuyayodjg`yez`iyijl`n`tshk`u`dhO`o`FHfhYHyhE`e`G`g`GHghGHghZH`zh`K`k`K`k`N`n`NGngP`p`O`o`C`C`T`t`UuH`h`TCZtczSH`SH`CH`ch`CH`ch`iZH`zh`CH`ch`A`a`A`a`E`e`A`a`ZH`zh`Z`z`Z`z`I`i`O`o`O`o`U`u`U`u`CH`ch`Y`y`'";
  char *inptr = (char *) str;
  size_t inlen = strlen (str) + 1;
  char outbuf[500];
  char *outptr = outbuf;
  size_t outlen = sizeof (outbuf);
  int result = 0;
  size_t n;

  if (setlocale (LC_ALL, "de_DE.UTF-8") == NULL)
    {
      puts ("setlocale failed");
      return 1;
    }

  cd = iconv_open ("ANSI_X3.4-1968//TRANSLIT", "UTF-8");
  if (cd == (iconv_t) -1)
    {
      puts ("iconv_open failed");
      return 1;
    }

  n = iconv (cd, &inptr, &inlen, &outptr, &outlen);
  if (n != 174)
    {
      if (n == (size_t) -1)
	printf ("iconv() returned error: %m\n");
      else
	printf ("iconv() returned %Zd, expected 7\n", n);
      result = 1;
    }
  if (inlen != 0)
    {
      puts ("not all input consumed");
      result = 1;
    }
  else if (inptr - str != strlen (str) + 1)
    {
      printf ("inptr wrong, advanced by %td\n", inptr - str);
      result = 1;
    }
  if (memcmp (outbuf, expected, sizeof (expected)) != 0)
    {
      printf ("result wrong: \"%.*s\", expected: \"%s\"\n",
	      (int) (sizeof (outbuf) - outlen), outbuf, expected);
      result = 1;
    }
  else if (outlen != sizeof (outbuf) - sizeof (expected))
    {
      printf ("outlen wrong: %Zd, expected %Zd\n", outlen,
	      sizeof (outbuf) - 15);
      result = 1;
    }
  else
    printf ("output is \"%s\" which is OK\n", outbuf);

  return result;
}

  reply	other threads:[~2018-10-15 11:55 UTC|newest]

Thread overview: 111+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com>
     [not found] ` <20180412224352.GB2911@altlinux.org>
2018-07-17 19:34   ` SUBJECT: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] Egor Kobylkin
2018-07-17 19:40     ` Carlos O'Donell
2018-07-17 19:50       ` Egor Kobylkin
2018-07-17 19:59         ` Carlos O'Donell
2018-08-06 19:00   ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29 Egor Kobylkin
2018-10-03  8:26     ` Egor Kobylkin
2018-10-03  9:19       ` Keld Simonsen
2018-10-03  9:32         ` Egor Kobylkin
2018-10-05  8:43           ` Marko Myllynen
2018-10-05  9:20           ` Rafal Luzynski
2018-10-05 10:36             ` Egor Kobylkin
2018-10-08 22:04               ` Rafal Luzynski
2018-10-08 22:52                 ` Egor Kobylkin
2018-10-09 21:43                   ` Rafal Luzynski
2018-10-08 23:20                 ` Zack Weinberg
2018-10-09 15:26                   ` Carlos O'Donell
2018-10-09 21:51                     ` Rafal Luzynski
2018-10-09 16:10                 ` Marko Myllynen
2018-10-09 16:22                   ` Egor Kobylkin
2018-10-09 16:49                     ` Marko Myllynen
2018-10-09 22:08                   ` Rafal Luzynski
2018-10-10 11:21                     ` Marko Myllynen
2018-10-11 10:10                   ` Marko Myllynen
     [not found]             ` <deacdf31-d0bb-a92d-1de3-934d6b4cb158@kobylkin.com>
2018-10-05 11:54               ` Marko Myllynen
2018-10-05 12:00                 ` Egor Kobylkin
2018-10-05 12:21                   ` Marko Myllynen
2018-10-05 20:47                     ` Egor Kobylkin
2018-10-08 12:40                       ` Marko Myllynen
2018-10-08 22:23                         ` Rafal Luzynski
2018-10-08 23:35                           ` Egor Kobylkin
2018-10-09 13:18                             ` Egor Kobylkin
2018-10-09 18:34                               ` Egor Kobylkin
2018-10-09 22:17                                 ` Rafal Luzynski
2018-10-09 22:40                                   ` Egor Kobylkin
2018-10-09 22:42                                     ` Egor Kobylkin
2018-10-10 11:22                                       ` Marko Myllynen
2018-10-10 12:19                                         ` Egor Kobylkin
2018-10-10 12:34                                           ` Marko Myllynen
2018-10-10 22:29   ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v2 Egor Kobylkin
2018-10-11  9:59     ` Marko Myllynen
2018-10-11 11:04     ` Rafal Luzynski
2018-10-11 13:10       ` Marko Myllynen
2018-10-11 13:50       ` Volodymyr Lisivka
2018-10-11 14:59       ` Egor Kobylkin
2018-10-11 21:30         ` Egor Kobylkin
2018-10-11 15:05       ` Egor Kobylkin
2018-10-11 15:44   ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v3 Egor Kobylkin
2018-10-11 21:33   ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v4 Egor Kobylkin
2018-10-12 14:05   ` [PATCH v5] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] Egor Kobylkin
2018-10-13  0:59     ` Rafal Luzynski
2018-10-13 16:58       ` Egor Kobylkin
2018-10-15 11:04         ` Marko Myllynen
2018-10-15 11:54           ` Egor Kobylkin [this message]
2018-10-23 23:08         ` Rafal Luzynski
2018-10-17 14:16   ` [PATCH v6] " Egor Kobylkin
2018-11-01 22:51   ` [PATCH v7] " Egor Kobylkin
2018-11-02  0:00   ` [PATCH v8] " Egor Kobylkin
2018-11-02 22:22     ` Rafal Luzynski
2018-11-02 23:27       ` Egor Kobylkin
2018-11-14 21:25   ` [PATCH v9] " Egor Kobylkin
2018-11-16 22:17     ` Rafal Luzynski
2018-11-17 18:34       ` Egor Kobylkin
2018-11-19  7:13         ` Marko Myllynen
2018-11-19  9:21           ` Egor Kobylkin
2018-11-19 19:35             ` Marko Myllynen
2018-12-01 22:07           ` Rafal Luzynski
2018-12-01 22:53             ` Egor Kobylkin
2018-12-03 22:19             ` Egor Kobylkin
2018-12-08  1:15               ` Rafal Luzynski
2018-12-10 21:20                 ` Marko Myllynen
2018-12-19 22:25                   ` Rafal Luzynski
2018-12-19 22:48                     ` Egor Kobylkin
2018-12-19 23:50                       ` Rafal Luzynski
2018-11-19 11:10   ` [PATCH v10] " Egor Kobylkin
2018-12-07 23:35     ` Rafal Luzynski
2018-12-08 21:51       ` Egor Kobylkin
2018-12-19 22:41         ` Rafal Luzynski
2018-12-19 23:02           ` Egor Kobylkin
2018-12-20  0:05             ` Rafal Luzynski
2018-12-08 22:28   ` [PATCH v11] Locales: Cyrillic -> ASCII transliteration " Egor Kobylkin
2018-12-19 23:16     ` Egor Kobylkin
2018-12-26 10:07       ` Siddhesh Poyarekar
2018-12-26 12:13         ` Egor Kobylkin
2018-12-27  1:30           ` Siddhesh Poyarekar
2018-12-27 11:28             ` Rafal Luzynski
2019-01-02 18:38   ` [PATCH v12] " Egor Kobylkin
2019-01-05 14:35     ` Rafal Luzynski
2019-01-05 21:12       ` Egor Kobylkin
2019-01-07 20:37         ` Marko Myllynen
2019-01-09  0:46           ` Egor Kobylkin
2019-01-09 20:03             ` Marko Myllynen
2019-02-04  7:14               ` [PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] ping for 2.30 Egor Kobylkin
2019-02-14 16:48                 ` Marko Myllynen
2019-03-04 22:11                   ` Egor Kobylkin
2019-03-11 13:59                     ` PING " Egor Kobylkin
2019-03-14 19:48                       ` Egor Kobylkin
2019-04-19 22:24                   ` Rafal Luzynski
     [not found]                     ` <5ELixS9SQ0DW4mlvswp96ASpLobBabU9KQ6zOTH-Udrb34mABhcqiPERpBZfPWZ9F77s8XNmiLIAq9UWu0AjLFFdjOz_FZVU5_xF-SiQkrw=@kobylkin.com>
2019-04-27  2:51                       ` Siddhesh Poyarekar
2019-04-27  7:34                         ` Diego (Egor) Kobylkin
2019-04-09  1:04     ` [PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] Carlos O'Donell
2019-03-19 10:39   ` ping " Egor Kobylkin
2019-03-28 16:20     ` [PING^4][PATCH " Marko Myllynen
2019-04-04 19:44     ` [PING^5][PATCH " Egor Kobylkin
2019-04-06  1:36       ` Siddhesh Poyarekar
2019-04-16  7:15     ` [PING^6][PATCH " Marko Myllynen
2019-04-16 13:17       ` Carlos O'Donell
2019-04-16 17:06         ` Egor Kobylkin
2019-04-16 17:58           ` Carlos O'Donell
2019-04-16 18:41             ` Egor Kobylkin
2019-04-16 19:06               ` Carlos O'Donell
2019-05-10 12:19                 ` Marko Myllynen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/libc/involved.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8206061b-4e9c-b366-85a4-93ef61687ca0@kobylkin.com \
    --to=egor@kobylkin.com \
    --cc=danilo@gnome.org \
    --cc=digitalfreak@lingonborough.com \
    --cc=ldv@altlinux.org \
    --cc=libc-alpha@sourceware.org \
    --cc=libc-locales@sourceware.org \
    --cc=mfabian@redhat.com \
    --cc=mkutny@gmail.com \
    --cc=myllynen@redhat.com \
    --cc=vlisivka@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).