From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-96420-e=80x24.org@sourceware.org>
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN: AS31976 209.132.180.0/23
X-Spam-Status: No, score=-2.0 required=3.0 tests=AWL,BAD_ENC_HEADER,BAYES_00,
	BODY_8BITS,DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,PP_MIME_FAKE_ASCII_TEXT,RCVD_IN_DNSWL_MED,SPF_HELO_PASS,
	SPF_PASS shortcircuit=no autolearn=no autolearn_force=no version=3.4.1
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by dcvr.yhbt.net (Postfix) with ESMTPS id 5A1CC1F453
	for <e@80x24.org>; Mon, 15 Oct 2018 11:55:14 +0000 (UTC)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:subject:to:cc:references:from:message-id:date
	:mime-version:in-reply-to:content-type; q=dns; s=default; b=TPAQ
	XMQMvNZeXhUZrQZe4T5ZoVRvhiCURz4/EmSnhRmTlD1ak9zjGBdwLiTkTNYGz7+f
	Nwbzo4GCwwQ3IEV3NO3p0I59EM+R7zfYUa2ZIJ4cfr4Qu4GB/bVntmIqmpBDwxf7
	zB5ucXh612K2HSdqV+vXRnZYjtAEH3Suvy4nx7A=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:subject:to:cc:references:from:message-id:date
	:mime-version:in-reply-to:content-type; s=default; bh=5APoPSSznv
	6+IXhviHZFVkUSWlE=; b=CB/X0dbB4wi5uCIAfuggflBcUY1F+szOXGzW5pmTC9
	OBY2Jlb0gDglRuBRn/r5t0Grcdme7JfrhozqwSMsiG27jJr14h8IaxM3t0DOe5CV
	aslGu4eljK0VV9u1mm34jv6UaFy6lF5srZrEEJxYIQGJODgzYkjiToZ/tWLa8OI0
	E=
Received: (qmail 46248 invoked by alias); 15 Oct 2018 11:55:11 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-e=80x24.org@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 44420 invoked by uid 89); 15 Oct 2018 11:55:10 -0000
Authentication-Results: sourceware.org; auth=none
X-HELO: mout.kundenserver.de
Subject: Re: [PATCH v5] Locales: Cyrillic -> ASCII transliteration table [BZ
 #2872]
To: Marko Myllynen <myllynen@redhat.com>,
 Rafal Luzynski <digitalfreak@lingonborough.com>, libc-alpha@sourceware.org,
 libc-locales@sourceware.org
Cc: mfabian@redhat.com, "Dmitry V. Levin" <ldv@altlinux.org>,
 Volodymyr Lisivka <vlisivka@gmail.com>, Max Kutny <mkutny@gmail.com>,
 danilo@gnome.org
References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com>
 <20180412224352.GB2911@altlinux.org>
 <d5582688-819b-90c2-3f4a-0d19c932d487@kobylkin.com>
 <165238610.582597.1539392357757@poczta.nazwa.pl>
 <e072a70c-9962-4087-93c2-06ec3c9a0b1f@kobylkin.com>
 <1374aef3-4c16-b9cd-49a6-b6da9b1a9eeb@redhat.com>
From: Egor Kobylkin <egor@kobylkin.com>
Openpgp: preference=signencrypt
Message-ID: <8206061b-4e9c-b366-85a4-93ef61687ca0@kobylkin.com>
Date: Mon, 15 Oct 2018 13:54:53 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.9.1
MIME-Version: 1.0
In-Reply-To: <1374aef3-4c16-b9cd-49a6-b6da9b1a9eeb@redhat.com>
Content-Type: multipart/mixed;
 boundary="------------854CBCEEDB262A895610EB51"

This is a multi-part message in MIME format.
--------------854CBCEEDB262A895610EB51
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

On 15.10.2018 13:04, Marko Myllynen wrote:
> Hi,
> 
> On 2018-10-13 19:58, Egor Kobylkin wrote:
>> On 13.10.2018 02:59, Rafal Luzynski wrote:
>>
>>> Regarding the tests, I think there is no complete transliteration 
>>> test suite at the moment.  Probably the only test is 
>>> localedata/bug-iconv-trans.c. You can also see the collation tests 
>>> placed in the same directory, they use those multiple *.UTF-8.in 
>>> files.
>>>
>>> You can skip the tests for now.
>>
>> First I though they could just be added but not all locales
>> transliterate Umlauts so just extending the current test won't do as it
>> will fail for those locales.
> 
> I still think a one-time check against uconv(1) (part of Unicode's ICU
> project) for discrepancies.

Just an addition. I have changes a few constants to see whether
localedata/bug-iconv-trans.c could be made to test cyrillic. Attached is
the bug-iconv-trans-cyr.c that goes through in this form. I had to save
it as UTF-8 instead of ISO-8859-15 for localedata/bug-iconv-trans.c.

>>>> [...] diff -uNr a/localedata/locales/am_ET 
>>>> b/localedata/locales/am_ET --- a/localedata/locales/am_ET 
>>>> 2018-10-11 15:10:11.000000000 +0000 +++ b/localedata/locales/am_ET 
>>>> 2018-10-11 15:10:43.000000000 +0000 @@ -1394,6 +1394,7 @@ <U137A> 
>>>> <U0060><U0039><U0030> <U137B> <U0060><U0031><U0030><U0030> <U137C> 
>>>> <U0060><U0031><U0030><U0030><U0030><U0030> +include 
>>>> "translit_cyrillic";"" translit_end % END LC_CTYPE
>>>
>>> Shouldn't “include "translit_cyrillic";""” be placed before the 
>>> custom rules, together with other includes?  The same in more files, 
>>> I will not mention them all.
>>
>> If I recall correctly it is because of the
>> "translit_end
>> END LC_CTYPE"
>> part at the end of the translit_cyrillic. This way it works for any
>> locale, regardless whether it has translit itself or not. And being at
>> the end it does not supersede any previous transliteration that may be
>> there for a reason.
> 
> I suspect one problem would be that the latter rule wins, so if there
> are some locale-specific rules than possible translit_* inclusions would
> override them if not included before the locale-specific rules.

What is the best way forward here? Can somebody make an explicit
suggestion on how to change the current approach if needed?

Bests,
Egor


--------------854CBCEEDB262A895610EB51
Content-Type: text/x-csrc;
 name="bug-iconv-trans-cyr.c"
Content-Transfer-Encoding: 8bit
Content-Disposition: attachment;
 filename="bug-iconv-trans-cyr.c"

#include <iconv.h>
#include <locale.h>
#include <stdio.h>
#include <string.h>

int
main (void)
{
  iconv_t cd;
  const char str[] = "CyrillicLetters_ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’";
  const char expected[] = "CyrillicLetters_YODJG`YEZ`IYIJL`N`TSHK`U`DHABVGDEZHZIJKLMNOPRSTUUFHCCHSHSHHA`Y`E`YUYAabvgdezhzijklmnoprstuufhcchshshh``y`e`yuyayodjg`yez`iyijl`n`tshk`u`dhO`o`FHfhYHyhE`e`G`g`GHghGHghZH`zh`K`k`K`k`N`n`NGngP`p`O`o`C`C`T`t`UuH`h`TCZtczSH`SH`CH`ch`CH`ch`iZH`zh`CH`ch`A`a`A`a`E`e`A`a`ZH`zh`Z`z`Z`z`I`i`O`o`O`o`U`u`U`u`CH`ch`Y`y`'";
  char *inptr = (char *) str;
  size_t inlen = strlen (str) + 1;
  char outbuf[500];
  char *outptr = outbuf;
  size_t outlen = sizeof (outbuf);
  int result = 0;
  size_t n;

  if (setlocale (LC_ALL, "de_DE.UTF-8") == NULL)
    {
      puts ("setlocale failed");
      return 1;
    }

  cd = iconv_open ("ANSI_X3.4-1968//TRANSLIT", "UTF-8");
  if (cd == (iconv_t) -1)
    {
      puts ("iconv_open failed");
      return 1;
    }

  n = iconv (cd, &inptr, &inlen, &outptr, &outlen);
  if (n != 174)
    {
      if (n == (size_t) -1)
	printf ("iconv() returned error: %m\n");
      else
	printf ("iconv() returned %Zd, expected 7\n", n);
      result = 1;
    }
  if (inlen != 0)
    {
      puts ("not all input consumed");
      result = 1;
    }
  else if (inptr - str != strlen (str) + 1)
    {
      printf ("inptr wrong, advanced by %td\n", inptr - str);
      result = 1;
    }
  if (memcmp (outbuf, expected, sizeof (expected)) != 0)
    {
      printf ("result wrong: \"%.*s\", expected: \"%s\"\n",
	      (int) (sizeof (outbuf) - outlen), outbuf, expected);
      result = 1;
    }
  else if (outlen != sizeof (outbuf) - sizeof (expected))
    {
      printf ("outlen wrong: %Zd, expected %Zd\n", outlen,
	      sizeof (outbuf) - 15);
      result = 1;
    }
  else
    printf ("output is \"%s\" which is OK\n", outbuf);

  return result;
}

--------------854CBCEEDB262A895610EB51--