From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-bounces+e=80x24.org@sourceware.org>
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN: AS17314 8.43.84.0/22
X-Spam-Status: No, score=-3.7 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI,
	PDS_RDNS_DYNAMIC_FP,RCVD_IN_DNSWL_MED,RDNS_DYNAMIC,SPF_HELO_PASS,
	SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2
Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	by dcvr.yhbt.net (Postfix) with ESMTPS id 381751F8C6
	for <e@80x24.org>; Thu, 29 Jul 2021 06:38:35 +0000 (UTC)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 6124B3898028
	for <e@80x24.org>; Thu, 29 Jul 2021 06:38:34 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6124B3898028
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1627540714;
	bh=jNosdQNgNhhXBvsMGdWHnt0g6Vg4lgagCuaqHey+Ad8=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=YDtjHk0WAc8uS3QZ4Ndm7bqakHjiWv4DicmVqX/85KMVzZ4Fc8zfAzbElFrRwM9NC
	 xM6rr3A04kcMzLiVDAmN1diJVp8/kVgtjiTki9rFjKezkHB3HGBAbn37So1vmwkxWa
	 pHoJoJtm+N1yZ6WQSC9gW5U5RMdqL88DMRRAo9ow=
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
 by sourceware.org (Postfix) with ESMTP id 0C046389800F
 for <libc-alpha@sourceware.org>; Thu, 29 Jul 2021 06:35:38 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 0C046389800F
Received: from mail-qv1-f72.google.com (mail-qv1-f72.google.com
 [209.85.219.72]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-328-ekVD0Wr-Md-Y6fLyTmvyJw-1; Thu, 29 Jul 2021 02:35:21 -0400
X-MC-Unique: ekVD0Wr-Md-Y6fLyTmvyJw-1
Received: by mail-qv1-f72.google.com with SMTP id
 t4-20020a05621421a4b02902e2f9404330so2581165qvc.9
 for <libc-alpha@sourceware.org>; Wed, 28 Jul 2021 23:35:21 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=jNosdQNgNhhXBvsMGdWHnt0g6Vg4lgagCuaqHey+Ad8=;
 b=SPT9uSLuwzfMiSvbN8GgQwQObnGdr0WHIUXfhcsLastzBNvh2Rrn5BzxB+yJKuyC5O
 w0bLs6pus7Phr3FKCUR7AUFuJWtSaShvJ0lwyFDtzhDAPRrtPU4RCk+e61mb+Nt0x5Nr
 pbMyxIFglT9xJM4rz51aDWzeQLdMBf8fZ78L8X/ZWl6HVZkCF2T/NuElVaQ/ue/Zx6GR
 TPwMP0fJHqf8aUBkHKkj6Qbgn1DXLE7eBRnUuxG9s0ZVSWOyzim9QKmp0oSa8ZecpwYO
 WKUt99cYpRbKXLUmq8+mYOif6dWTJ/rgjl2cqJtlN1Vd0XSEzX8F1WelRHfXSvuNhXYp
 HBaA==
X-Gm-Message-State: AOAM533m9Rc+z2MJ1r0ngXEgekxEhx5of5QPcpPqD2WXgUAjM+BrMsrk
 U+S/eFBv7iIZH7jsZuksTtOQkTjqEO5amvzTMQXZ4Rzb5LyHJwUPPtlBsy5lg1eF1ipTZz+HErc
 qRnx9UWrpykQEd3lqAN9ZnAdllOu2V79iN4upYTtSIyDkbNY/X9DjfEd3JBu8FW+NaAqnng==
X-Received: by 2002:a05:620a:e14:: with SMTP id
 y20mr3717550qkm.335.1627540521078; 
 Wed, 28 Jul 2021 23:35:21 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJwEzzmDj3Vg2+7qt3x6Ul2IF1SopTDEzRhb8Uv3dHUIXBZK0wMxuLp6zu+KPELmHWlgohVywA==
X-Received: by 2002:a05:620a:e14:: with SMTP id
 y20mr3717529qkm.335.1627540520779; 
 Wed, 28 Jul 2021 23:35:20 -0700 (PDT)
Received: from athas.redhat.com (198-84-214-74.cpe.teksavvy.com.
 [198.84.214.74])
 by smtp.gmail.com with ESMTPSA id y2sm1311857qkd.38.2021.07.28.23.35.19
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Wed, 28 Jul 2021 23:35:20 -0700 (PDT)
To: libc-alpha@sourceware.org
Subject: [PATCH v4 1/3] Add support for locales with zero collation rules.
Date: Thu, 29 Jul 2021 02:35:13 -0400
Message-Id: <20210729063515.1541388-2-carlos@redhat.com>
X-Mailer: git-send-email 2.31.1
In-Reply-To: <20210729063515.1541388-1-carlos@redhat.com>
References: <20210729063515.1541388-1-carlos@redhat.com>
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset="US-ASCII"
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
From: Carlos O'Donell via Libc-alpha <libc-alpha@sourceware.org>
Reply-To: Carlos O'Donell <carlos@redhat.com>
Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org
Sender: "Libc-alpha" <libc-alpha-bounces+e=80x24.org@sourceware.org>

While there is code to handle 'nrules == 0' in various locations
within posix/fnmatch_loop.c, posix/regcomp.c and posix/regexec.c,
these conditionals do not work.  The only collation with zero
rules in effect today is the builtin C/POSIX locale which is
built by hand, and despite have zero rules it has a collseqmb
and collseqwc tables stored in the locale data. These tables are
simple identity tables which are not actually required and could
be removed at a later date after this change.  The changes are in
order to prepare for C.UTF-8 which has zero rules and has no
collation sequence tables (multibyte or widechar).

No regressions on x86_64 or i686.
---
 posix/fnmatch_loop.c | 95 +++++++++++++++++++++++++++-----------------
 posix/regcomp.c      | 12 +++---
 posix/regexec.c      | 85 ++++++++++++++++++---------------------
 3 files changed, 104 insertions(+), 88 deletions(-)

diff --git a/posix/fnmatch_loop.c b/posix/fnmatch_loop.c
index 7f938af590..547952f0a9 100644
--- a/posix/fnmatch_loop.c
+++ b/posix/fnmatch_loop.c
@@ -51,6 +51,7 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
     _NL_CURRENT(LC_COLLATE, _NL_COLLATE_COLLSEQMB);
 # endif
 #endif
+  uint32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
 
   while ((c = *p++) != L_('\0'))
     {
@@ -324,8 +325,6 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
                        diagnose a "used initialized" in a dead branch in the
                        findidx function.  */
                     UCHAR str;
-                    uint32_t nrules =
-                      _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
                     const CHAR *startp = p;
 
                     c = *++p;
@@ -437,8 +436,6 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
 
                     if (c == L_('[') && *p == L_('.'))
                       {
-                        uint32_t nrules =
-                          _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
                         const CHAR *startp = p;
                         size_t c1 = 0;
 
@@ -600,42 +597,51 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
                     if (c == L_('-') && *p != L_(']'))
                       {
 #if _LIBC
-                        /* We have to find the collation sequence
-                           value for C.  Collation sequence is nothing
-                           we can regularly access.  The sequence
-                           value is defined by the order in which the
-                           definitions of the collation values for the
-                           various characters appear in the source
-                           file.  A strange concept, nowhere
-                           documented.  */
-                        uint32_t fcollseq;
-                        uint32_t lcollseq;
+			/* We must find the collation sequence values for
+			   the low part of the range, the high part of the
+			   range and the searched value FN.  We do this by
+			   using the POSIX concept of Collation Element
+			   Ordering, which is the defined order of elements
+			   in the source locale.  FCOLLSEQ is the searched
+			   element in the range, while LCOLLSEQ is the low
+			   element in the range.  If we have no collation
+			   rules (nrules == 0) then we must fall back to a
+			   basic code point value for the collation
+			   sequence value (which is correct for ASCII and
+			   UTF-8).  We must never use collseq if nrules ==
+			   0 since none of the tables we need will be
+			   present in the compiled binary locale.  We start
+			   with fcollseq and lcollseq at unknown collation
+			   sequences.  We only compute hcollseq, the high
+			   part of the range if required.  */
+                        uint32_t fcollseq = ~((uint32_t) 0);
+                        uint32_t lcollseq = ~((uint32_t) 0);
                         UCHAR cend = *p++;
 
+			if (nrules != 0)
+			  {
 # if WIDE_CHAR_VERSION
-                        /* Search in the 'names' array for the characters.  */
-                        fcollseq = __collseq_table_lookup (collseq, fn);
-                        if (fcollseq == ~((uint32_t) 0))
-                          /* XXX We don't know anything about the character
-                             we are supposed to match.  This means we are
-                             failing.  */
-                          goto range_not_matched;
-
-                        if (is_seqval)
-                          lcollseq = cold;
-                        else
-                          lcollseq = __collseq_table_lookup (collseq, cold);
+			    /* Search the collation data for the character.  */
+			    fcollseq = __collseq_table_lookup (collseq, fn);
+			    if (fcollseq == ~((uint32_t) 0))
+			      /* We don't know anything about the character
+				 we are supposed to match.  This means we are
+				 failing.  */
+			      goto range_not_matched;
+
+			    if (is_seqval)
+			      lcollseq = cold;
+			    else
+			      lcollseq = __collseq_table_lookup (collseq, cold);
 # else
-                        fcollseq = collseq[fn];
-                        lcollseq = is_seqval ? cold : collseq[(UCHAR) cold];
+			    fcollseq = collseq[fn];
+			    lcollseq = is_seqval ? cold : collseq[(UCHAR) cold];
 # endif
+			  }
 
                         is_seqval = false;
                         if (cend == L_('[') && *p == L_('.'))
                           {
-                            uint32_t nrules =
-                              _NL_CURRENT_WORD (LC_COLLATE,
-                                                _NL_COLLATE_NRULES);
                             const CHAR *startp = p;
                             size_t c1 = 0;
 
@@ -752,14 +758,20 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
                             cend = FOLD (cend);
                           }
 
-                        /* XXX It is not entirely clear to me how to handle
-                           characters which are not mentioned in the
-                           collation specification.  */
-                        if (
+			/* If we have rules, and the low sequence is lower than
+			   the value of the searched sequence then we must
+			   lookup the high collation sequence value and
+			   determine if the fcollseq falls within the range.
+			   If hcollseq is unknown then we could still match
+			   fcollseq on the low end of the range.  If lcollseq
+			   if unknown (0xffffffff) we will still fail to
+			   match, but in the future we might consider matching
+			   the high end of the range on an exact match.  */
+                        if (nrules != 0 && (
 # if WIDE_CHAR_VERSION
                             lcollseq == 0xffffffff ||
 # endif
-                            lcollseq <= fcollseq)
+                            lcollseq <= fcollseq))
                           {
                             /* We have to look at the upper bound.  */
                             uint32_t hcollseq;
@@ -789,6 +801,17 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
                             if (lcollseq <= hcollseq && fcollseq <= hcollseq)
                               goto matched;
                           }
+
+			/* No rules, but we have a range.  */
+			if (nrules == 0)
+			  {
+			    if (cend == L_('\0'))
+			      return FNM_NOMATCH;
+
+			    /* Compare that fn is within the range.  */
+			    if ((UCHAR) cold <= fn && fn <= cend)
+			      goto matched;
+			  }
 # if WIDE_CHAR_VERSION
                       range_not_matched:
 # endif
diff --git a/posix/regcomp.c b/posix/regcomp.c
index d93698ae78..f55d20cbfd 100644
--- a/posix/regcomp.c
+++ b/posix/regcomp.c
@@ -2889,7 +2889,7 @@ parse_bracket_exp (re_string_t *regexp, re_dfa_t *dfa, re_token_t *token,
 	  if (MB_CUR_MAX == 1)
 	  */
 	  if (nrules == 0)
-	    return collseqmb[br_elem->opr.ch];
+	    return br_elem->opr.ch;
 	  else
 	    {
 	      wint_t wc = __btowc (br_elem->opr.ch);
@@ -2900,6 +2900,8 @@ parse_bracket_exp (re_string_t *regexp, re_dfa_t *dfa, re_token_t *token,
 	{
 	  if (nrules != 0)
 	    return __collseq_table_lookup (collseqwc, br_elem->opr.wch);
+	  else
+	    return br_elem->opr.wch;
 	}
       else if (br_elem->type == COLL_SYM)
 	{
@@ -2935,7 +2937,7 @@ parse_bracket_exp (re_string_t *regexp, re_dfa_t *dfa, re_token_t *token,
 		}
 	    }
 	  else if (sym_name_len == 1)
-	    return collseqmb[br_elem->opr.name[0]];
+	    return br_elem->opr.name[0];
 	}
       return UINT_MAX;
     }
@@ -3017,7 +3019,7 @@ parse_bracket_exp (re_string_t *regexp, re_dfa_t *dfa, re_token_t *token,
 	  if (MB_CUR_MAX == 1)
 	  */
 	  if (nrules == 0)
-	    ch_collseq = collseqmb[ch];
+	    ch_collseq = ch;
 	  else
 	    ch_collseq = __collseq_table_lookup (collseqwc, __btowc (ch));
 	  if (start_collseq <= ch_collseq && ch_collseq <= end_collseq)
@@ -3103,11 +3105,11 @@ parse_bracket_exp (re_string_t *regexp, re_dfa_t *dfa, re_token_t *token,
   int token_len;
   bool first_round = true;
 #ifdef _LIBC
-  collseqmb = (const unsigned char *)
-    _NL_CURRENT (LC_COLLATE, _NL_COLLATE_COLLSEQMB);
   nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
   if (nrules)
     {
+      collseqmb = (const unsigned char *)
+	_NL_CURRENT (LC_COLLATE, _NL_COLLATE_COLLSEQMB);
       /*
       if (MB_CUR_MAX > 1)
       */
diff --git a/posix/regexec.c b/posix/regexec.c
index f7b4f9cfc3..6cc23831aa 100644
--- a/posix/regexec.c
+++ b/posix/regexec.c
@@ -3858,62 +3858,53 @@ check_node_accept_bytes (const re_dfa_t *dfa, Idx node_idx,
 }
 
 # ifdef _LIBC
+#include <assert.h>
+
 static unsigned int
 find_collation_sequence_value (const unsigned char *mbs, size_t mbs_len)
 {
-  uint32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
-  if (nrules == 0)
-    {
-      if (mbs_len == 1)
-	{
-	  /* No valid character.  Match it as a single byte character.  */
-	  const unsigned char *collseq = (const unsigned char *)
-	    _NL_CURRENT (LC_COLLATE, _NL_COLLATE_COLLSEQMB);
-	  return collseq[mbs[0]];
-	}
-      return UINT_MAX;
-    }
-  else
-    {
-      int32_t idx;
-      const unsigned char *extra = (const unsigned char *)
+  int32_t idx;
+  const unsigned char *extra = (const unsigned char *)
 	_NL_CURRENT (LC_COLLATE, _NL_COLLATE_SYMB_EXTRAMB);
-      int32_t extrasize = (const unsigned char *)
+  int32_t extrasize = (const unsigned char *)
 	_NL_CURRENT (LC_COLLATE, _NL_COLLATE_SYMB_EXTRAMB + 1) - extra;
+  uint32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
+
+  /* Only called from within 'if (nrules != 0)'.  */ 
+  assert (nrules != 0);
 
-      for (idx = 0; idx < extrasize;)
+  for (idx = 0; idx < extrasize;)
+    {
+      int mbs_cnt;
+      bool found = false;
+      int32_t elem_mbs_len;
+      /* Skip the name of collating element name.  */
+      idx = idx + extra[idx] + 1;
+      elem_mbs_len = extra[idx++];
+      if (mbs_len == elem_mbs_len)
 	{
-	  int mbs_cnt;
-	  bool found = false;
-	  int32_t elem_mbs_len;
-	  /* Skip the name of collating element name.  */
-	  idx = idx + extra[idx] + 1;
-	  elem_mbs_len = extra[idx++];
-	  if (mbs_len == elem_mbs_len)
-	    {
-	      for (mbs_cnt = 0; mbs_cnt < elem_mbs_len; ++mbs_cnt)
-		if (extra[idx + mbs_cnt] != mbs[mbs_cnt])
-		  break;
-	      if (mbs_cnt == elem_mbs_len)
-		/* Found the entry.  */
-		found = true;
-	    }
-	  /* Skip the byte sequence of the collating element.  */
-	  idx += elem_mbs_len;
-	  /* Adjust for the alignment.  */
-	  idx = (idx + 3) & ~3;
-	  /* Skip the collation sequence value.  */
-	  idx += sizeof (uint32_t);
-	  /* Skip the wide char sequence of the collating element.  */
-	  idx = idx + sizeof (uint32_t) * (*(int32_t *) (extra + idx) + 1);
-	  /* If we found the entry, return the sequence value.  */
-	  if (found)
-	    return *(uint32_t *) (extra + idx);
-	  /* Skip the collation sequence value.  */
-	  idx += sizeof (uint32_t);
+	  for (mbs_cnt = 0; mbs_cnt < elem_mbs_len; ++mbs_cnt)
+	    if (extra[idx + mbs_cnt] != mbs[mbs_cnt])
+	      break;
+	  if (mbs_cnt == elem_mbs_len)
+	    /* Found the entry.  */
+	    found = true;
 	}
-      return UINT_MAX;
+      /* Skip the byte sequence of the collating element.  */
+      idx += elem_mbs_len;
+      /* Adjust for the alignment.  */
+      idx = (idx + 3) & ~3;
+      /* Skip the collation sequence value.  */
+      idx += sizeof (uint32_t);
+      /* Skip the wide char sequence of the collating element.  */
+      idx = idx + sizeof (uint32_t) * (*(int32_t *) (extra + idx) + 1);
+      /* If we found the entry, return the sequence value.  */
+      if (found)
+        return *(uint32_t *) (extra + idx);
+      /* Skip the collation sequence value.  */
+      idx += sizeof (uint32_t);
     }
+  return UINT_MAX;
 }
 # endif /* _LIBC */
 #endif /* RE_ENABLE_I18N */
-- 
2.31.1