From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-bounces+e=80x24.org@sourceware.org>
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN: AS17314 8.43.84.0/22
X-Spam-Status: No, score=-3.7 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI,
	PDS_RDNS_DYNAMIC_FP,RCVD_IN_DNSWL_MED,RDNS_DYNAMIC,SPF_HELO_PASS,
	SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2
Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	by dcvr.yhbt.net (Postfix) with ESMTPS id 3F99F1F8C6
	for <e@80x24.org>; Thu, 29 Jul 2021 09:44:57 +0000 (UTC)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id F27EE3855007
	for <e@80x24.org>; Thu, 29 Jul 2021 09:44:55 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org F27EE3855007
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1627551896;
	bh=E5qlgvDHVgynpYmK8zzEAabWIP9oL3R+jdN35w6TyYw=;
	h=To:Subject:References:Date:In-Reply-To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=veiwzOLJ/zqhHfGecXPcBDZyk1d+OF6lkrdLiFBZhFpZkXSiroOaCZWjQ3X6XCy/S
	 5+492mRIJzxu5o2BW0dMd12lp7p6xf7C7PDOiK5IlRvBoNJKqBkRLoYmPxKolHWELW
	 t0FWRh0hL0f7ZV2+IVA7+Yk+zuWmnTOwu6YZeMhI=
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
 by sourceware.org (Postfix) with ESMTP id 694813857825
 for <libc-alpha@sourceware.org>; Thu, 29 Jul 2021 09:44:36 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 694813857825
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-340-S3mXvfzTNYOokryieUp71A-1; Thu, 29 Jul 2021 05:44:27 -0400
X-MC-Unique: S3mXvfzTNYOokryieUp71A-1
Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com
 [10.5.11.23])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mimecast-mx01.redhat.com (Postfix) with ESMTPS id B2C471006C91
 for <libc-alpha@sourceware.org>; Thu, 29 Jul 2021 09:44:26 +0000 (UTC)
Received: from oldenburg.str.redhat.com (ovpn-112-7.ams2.redhat.com
 [10.36.112.7])
 by smtp.corp.redhat.com (Postfix) with ESMTPS id A96B419D7C;
 Thu, 29 Jul 2021 09:44:21 +0000 (UTC)
To: Carlos O'Donell via Libc-alpha <libc-alpha@sourceware.org>
Subject: Re: [PATCH v4 1/3] Add support for locales with zero collation rules.
References: <20210729063515.1541388-1-carlos@redhat.com>
 <20210729063515.1541388-2-carlos@redhat.com>
Date: Thu, 29 Jul 2021 11:44:20 +0200
In-Reply-To: <20210729063515.1541388-2-carlos@redhat.com> (Carlos O'Donell via
 Libc-alpha's message of "Thu, 29 Jul 2021 02:35:13 -0400")
Message-ID: <8735rx4fpn.fsf@oldenburg.str.redhat.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
From: Florian Weimer via Libc-alpha <libc-alpha@sourceware.org>
Reply-To: Florian Weimer <fweimer@redhat.com>
Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org
Sender: "Libc-alpha" <libc-alpha-bounces+e=80x24.org@sourceware.org>

* Carlos O'Donell via Libc-alpha:

> @@ -600,42 +597,51 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
>                      if (c == L_('-') && *p != L_(']'))
>                        {
>  #if _LIBC
> -                        /* We have to find the collation sequence
> -                           value for C.  Collation sequence is nothing
> -                           we can regularly access.  The sequence
> -                           value is defined by the order in which the
> -                           definitions of the collation values for the
> -                           various characters appear in the source
> -                           file.  A strange concept, nowhere
> -                           documented.  */
> -                        uint32_t fcollseq;
> -                        uint32_t lcollseq;
> +			/* We must find the collation sequence values for
> +			   the low part of the range, the high part of the
> +			   range and the searched value FN.  We do this by
> +			   using the POSIX concept of Collation Element
> +			   Ordering, which is the defined order of elements
> +			   in the source locale.  FCOLLSEQ is the searched
> +			   element in the range, while LCOLLSEQ is the low
> +			   element in the range.  If we have no collation
> +			   rules (nrules == 0) then we must fall back to a
> +			   basic code point value for the collation
> +			   sequence value (which is correct for ASCII and
> +			   UTF-8).  We must never use collseq if nrules ==
> +			   0 since none of the tables we need will be
> +			   present in the compiled binary locale.  We start
> +			   with fcollseq and lcollseq at unknown collation
> +			   sequences.  We only compute hcollseq, the high
> +			   part of the range if required.  */
> +                        uint32_t fcollseq = ~((uint32_t) 0);
> +                        uint32_t lcollseq = ~((uint32_t) 0);
>                          UCHAR cend = *p++;

Looks like ~((uint32_t) 0) needs to be added as a macro/constant to
__collseq_table_lookup.


>  
> +			if (nrules != 0)
> +			  {
>  # if WIDE_CHAR_VERSION
> +			    /* Search the collation data for the character.  */
> +			    fcollseq = __collseq_table_lookup (collseq, fn);
> +			    if (fcollseq == ~((uint32_t) 0))
> +			      /* We don't know anything about the character
> +				 we are supposed to match.  This means we are
> +				 failing.  */
> +			      goto range_not_matched;
> +
> +			    if (is_seqval)
> +			      lcollseq = cold;
> +			    else
> +			      lcollseq = __collseq_table_lookup (collseq, cold);
>  # else
> +			    fcollseq = collseq[fn];
> +			    lcollseq = is_seqval ? cold : collseq[(UCHAR) cold];
>  # endif
> +			  }
>  
>                          is_seqval = false;
>                          if (cend == L_('[') && *p == L_('.'))
>                            {
>                              const CHAR *startp = p;
>                              size_t c1 = 0;
>  
> @@ -752,14 +758,20 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
>                              cend = FOLD (cend);
>                            }
>  
> +			/* If we have rules, and the low sequence is lower than
> +			   the value of the searched sequence then we must
> +			   lookup the high collation sequence value and
> +			   determine if the fcollseq falls within the range.
> +			   If hcollseq is unknown then we could still match
> +			   fcollseq on the low end of the range.  If lcollseq
> +			   if unknown (0xffffffff) we will still fail to
> +			   match, but in the future we might consider matching
> +			   the high end of the range on an exact match.  */
> +                        if (nrules != 0 && (
>  # if WIDE_CHAR_VERSION
>                              lcollseq == 0xffffffff ||

This should use the same constant as the initialization.  The #if is now
unnecessary and can be removed.

>  # endif
> +                            lcollseq <= fcollseq))
>                            {
>                              /* We have to look at the upper bound.  */
>                              uint32_t hcollseq;
> @@ -789,6 +801,17 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
>                              if (lcollseq <= hcollseq && fcollseq <= hcollseq)
>                                goto matched;
>                            }
> +
> +			/* No rules, but we have a range.  */
> +			if (nrules == 0)
> +			  {
> +			    if (cend == L_('\0'))
> +			      return FNM_NOMATCH;
> +
> +			    /* Compare that fn is within the range.  */
> +			    if ((UCHAR) cold <= fn && fn <= cend)
> +			      goto matched;
> +

This part looks okay to me.

> diff --git a/posix/regcomp.c b/posix/regcomp.c
> index d93698ae78..f55d20cbfd 100644
> --- a/posix/regcomp.c
> +++ b/posix/regcomp.c
> @@ -2889,7 +2889,7 @@ parse_bracket_exp (re_string_t *regexp, re_dfa_t *dfa, re_token_t *token,
>  	  if (MB_CUR_MAX == 1)
>  	  */
>  	  if (nrules == 0)
> -	    return collseqmb[br_elem->opr.ch];
> +	    return br_elem->opr.ch;
>  	  else
>  	    {
>  	      wint_t wc = __btowc (br_elem->opr.ch);
> @@ -2900,6 +2900,8 @@ parse_bracket_exp (re_string_t *regexp, re_dfa_t *dfa, re_token_t *token,
>  	{
>  	  if (nrules != 0)
>  	    return __collseq_table_lookup (collseqwc, br_elem->opr.wch);
> +	  else
> +	    return br_elem->opr.wch;
>  	}
>        else if (br_elem->type == COLL_SYM)
>  	{
> @@ -2935,7 +2937,7 @@ parse_bracket_exp (re_string_t *regexp, re_dfa_t *dfa, re_token_t *token,
>  		}
>  	    }
>  	  else if (sym_name_len == 1)
> -	    return collseqmb[br_elem->opr.name[0]];
> +	    return br_elem->opr.name[0];
>  	}
>        return UINT_MAX;
>      }
> @@ -3017,7 +3019,7 @@ parse_bracket_exp (re_string_t *regexp, re_dfa_t *dfa, re_token_t *token,
>  	  if (MB_CUR_MAX == 1)
>  	  */
>  	  if (nrules == 0)
> -	    ch_collseq = collseqmb[ch];
> +	    ch_collseq = ch;
>  	  else
>  	    ch_collseq = __collseq_table_lookup (collseqwc, __btowc (ch));
>  	  if (start_collseq <= ch_collseq && ch_collseq <= end_collseq)
> @@ -3103,11 +3105,11 @@ parse_bracket_exp (re_string_t *regexp, re_dfa_t *dfa, re_token_t *token,
>    int token_len;
>    bool first_round = true;
>  #ifdef _LIBC
> -  collseqmb = (const unsigned char *)
> -    _NL_CURRENT (LC_COLLATE, _NL_COLLATE_COLLSEQMB);
>    nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
>    if (nrules)
>      {
> +      collseqmb = (const unsigned char *)
> +	_NL_CURRENT (LC_COLLATE, _NL_COLLATE_COLLSEQMB);
>        /*
>        if (MB_CUR_MAX > 1)
>        */

These changes look good.

> diff --git a/posix/regexec.c b/posix/regexec.c
> index f7b4f9cfc3..6cc23831aa 100644
> --- a/posix/regexec.c
> +++ b/posix/regexec.c
> @@ -3858,62 +3858,53 @@ check_node_accept_bytes (const re_dfa_t *dfa, Idx node_idx,
>  }
>  
>  # ifdef _LIBC
> +#include <assert.h>

This really should got to the start of the file.

> +
>  static unsigned int
>  find_collation_sequence_value (const unsigned char *mbs, size_t mbs_len)
>  {
> +  int32_t idx;
> +  const unsigned char *extra = (const unsigned char *)
>  	_NL_CURRENT (LC_COLLATE, _NL_COLLATE_SYMB_EXTRAMB);
> +  int32_t extrasize = (const unsigned char *)
>  	_NL_CURRENT (LC_COLLATE, _NL_COLLATE_SYMB_EXTRAMB + 1) - extra;
> +  uint32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
> +
> +  /* Only called from within 'if (nrules != 0)'.  */ 

Trailing whitespace on the last line.  Most of this is just
reindentation, so okay.

> +  assert (nrules != 0);

Right, the single caller is under nrules != 0.

strcoll and strxfrm are already correct.

Thanks,
Florian