From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS17314 8.43.84.0/22 X-Spam-Status: No, score=-3.7 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, PDS_RDNS_DYNAMIC_FP,RCVD_IN_DNSWL_MED,RDNS_DYNAMIC,SPF_HELO_PASS, SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 3F99F1F8C6 for ; Thu, 29 Jul 2021 09:44:57 +0000 (UTC) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id F27EE3855007 for ; Thu, 29 Jul 2021 09:44:55 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org F27EE3855007 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1627551896; bh=E5qlgvDHVgynpYmK8zzEAabWIP9oL3R+jdN35w6TyYw=; h=To:Subject:References:Date:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=veiwzOLJ/zqhHfGecXPcBDZyk1d+OF6lkrdLiFBZhFpZkXSiroOaCZWjQ3X6XCy/S 5+492mRIJzxu5o2BW0dMd12lp7p6xf7C7PDOiK5IlRvBoNJKqBkRLoYmPxKolHWELW t0FWRh0hL0f7ZV2+IVA7+Yk+zuWmnTOwu6YZeMhI= Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTP id 694813857825 for ; Thu, 29 Jul 2021 09:44:36 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 694813857825 Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-340-S3mXvfzTNYOokryieUp71A-1; Thu, 29 Jul 2021 05:44:27 -0400 X-MC-Unique: S3mXvfzTNYOokryieUp71A-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id B2C471006C91 for ; Thu, 29 Jul 2021 09:44:26 +0000 (UTC) Received: from oldenburg.str.redhat.com (ovpn-112-7.ams2.redhat.com [10.36.112.7]) by smtp.corp.redhat.com (Postfix) with ESMTPS id A96B419D7C; Thu, 29 Jul 2021 09:44:21 +0000 (UTC) To: Carlos O'Donell via Libc-alpha Subject: Re: [PATCH v4 1/3] Add support for locales with zero collation rules. References: <20210729063515.1541388-1-carlos@redhat.com> <20210729063515.1541388-2-carlos@redhat.com> Date: Thu, 29 Jul 2021 11:44:20 +0200 In-Reply-To: <20210729063515.1541388-2-carlos@redhat.com> (Carlos O'Donell via Libc-alpha's message of "Thu, 29 Jul 2021 02:35:13 -0400") Message-ID: <8735rx4fpn.fsf@oldenburg.str.redhat.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Florian Weimer via Libc-alpha Reply-To: Florian Weimer Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org Sender: "Libc-alpha" * Carlos O'Donell via Libc-alpha: > @@ -600,42 +597,51 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end, > if (c == L_('-') && *p != L_(']')) > { > #if _LIBC > - /* We have to find the collation sequence > - value for C. Collation sequence is nothing > - we can regularly access. The sequence > - value is defined by the order in which the > - definitions of the collation values for the > - various characters appear in the source > - file. A strange concept, nowhere > - documented. */ > - uint32_t fcollseq; > - uint32_t lcollseq; > + /* We must find the collation sequence values for > + the low part of the range, the high part of the > + range and the searched value FN. We do this by > + using the POSIX concept of Collation Element > + Ordering, which is the defined order of elements > + in the source locale. FCOLLSEQ is the searched > + element in the range, while LCOLLSEQ is the low > + element in the range. If we have no collation > + rules (nrules == 0) then we must fall back to a > + basic code point value for the collation > + sequence value (which is correct for ASCII and > + UTF-8). We must never use collseq if nrules == > + 0 since none of the tables we need will be > + present in the compiled binary locale. We start > + with fcollseq and lcollseq at unknown collation > + sequences. We only compute hcollseq, the high > + part of the range if required. */ > + uint32_t fcollseq = ~((uint32_t) 0); > + uint32_t lcollseq = ~((uint32_t) 0); > UCHAR cend = *p++; Looks like ~((uint32_t) 0) needs to be added as a macro/constant to __collseq_table_lookup. > > + if (nrules != 0) > + { > # if WIDE_CHAR_VERSION > + /* Search the collation data for the character. */ > + fcollseq = __collseq_table_lookup (collseq, fn); > + if (fcollseq == ~((uint32_t) 0)) > + /* We don't know anything about the character > + we are supposed to match. This means we are > + failing. */ > + goto range_not_matched; > + > + if (is_seqval) > + lcollseq = cold; > + else > + lcollseq = __collseq_table_lookup (collseq, cold); > # else > + fcollseq = collseq[fn]; > + lcollseq = is_seqval ? cold : collseq[(UCHAR) cold]; > # endif > + } > > is_seqval = false; > if (cend == L_('[') && *p == L_('.')) > { > const CHAR *startp = p; > size_t c1 = 0; > > @@ -752,14 +758,20 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end, > cend = FOLD (cend); > } > > + /* If we have rules, and the low sequence is lower than > + the value of the searched sequence then we must > + lookup the high collation sequence value and > + determine if the fcollseq falls within the range. > + If hcollseq is unknown then we could still match > + fcollseq on the low end of the range. If lcollseq > + if unknown (0xffffffff) we will still fail to > + match, but in the future we might consider matching > + the high end of the range on an exact match. */ > + if (nrules != 0 && ( > # if WIDE_CHAR_VERSION > lcollseq == 0xffffffff || This should use the same constant as the initialization. The #if is now unnecessary and can be removed. > # endif > + lcollseq <= fcollseq)) > { > /* We have to look at the upper bound. */ > uint32_t hcollseq; > @@ -789,6 +801,17 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end, > if (lcollseq <= hcollseq && fcollseq <= hcollseq) > goto matched; > } > + > + /* No rules, but we have a range. */ > + if (nrules == 0) > + { > + if (cend == L_('\0')) > + return FNM_NOMATCH; > + > + /* Compare that fn is within the range. */ > + if ((UCHAR) cold <= fn && fn <= cend) > + goto matched; > + This part looks okay to me. > diff --git a/posix/regcomp.c b/posix/regcomp.c > index d93698ae78..f55d20cbfd 100644 > --- a/posix/regcomp.c > +++ b/posix/regcomp.c > @@ -2889,7 +2889,7 @@ parse_bracket_exp (re_string_t *regexp, re_dfa_t *dfa, re_token_t *token, > if (MB_CUR_MAX == 1) > */ > if (nrules == 0) > - return collseqmb[br_elem->opr.ch]; > + return br_elem->opr.ch; > else > { > wint_t wc = __btowc (br_elem->opr.ch); > @@ -2900,6 +2900,8 @@ parse_bracket_exp (re_string_t *regexp, re_dfa_t *dfa, re_token_t *token, > { > if (nrules != 0) > return __collseq_table_lookup (collseqwc, br_elem->opr.wch); > + else > + return br_elem->opr.wch; > } > else if (br_elem->type == COLL_SYM) > { > @@ -2935,7 +2937,7 @@ parse_bracket_exp (re_string_t *regexp, re_dfa_t *dfa, re_token_t *token, > } > } > else if (sym_name_len == 1) > - return collseqmb[br_elem->opr.name[0]]; > + return br_elem->opr.name[0]; > } > return UINT_MAX; > } > @@ -3017,7 +3019,7 @@ parse_bracket_exp (re_string_t *regexp, re_dfa_t *dfa, re_token_t *token, > if (MB_CUR_MAX == 1) > */ > if (nrules == 0) > - ch_collseq = collseqmb[ch]; > + ch_collseq = ch; > else > ch_collseq = __collseq_table_lookup (collseqwc, __btowc (ch)); > if (start_collseq <= ch_collseq && ch_collseq <= end_collseq) > @@ -3103,11 +3105,11 @@ parse_bracket_exp (re_string_t *regexp, re_dfa_t *dfa, re_token_t *token, > int token_len; > bool first_round = true; > #ifdef _LIBC > - collseqmb = (const unsigned char *) > - _NL_CURRENT (LC_COLLATE, _NL_COLLATE_COLLSEQMB); > nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES); > if (nrules) > { > + collseqmb = (const unsigned char *) > + _NL_CURRENT (LC_COLLATE, _NL_COLLATE_COLLSEQMB); > /* > if (MB_CUR_MAX > 1) > */ These changes look good. > diff --git a/posix/regexec.c b/posix/regexec.c > index f7b4f9cfc3..6cc23831aa 100644 > --- a/posix/regexec.c > +++ b/posix/regexec.c > @@ -3858,62 +3858,53 @@ check_node_accept_bytes (const re_dfa_t *dfa, Idx node_idx, > } > > # ifdef _LIBC > +#include This really should got to the start of the file. > + > static unsigned int > find_collation_sequence_value (const unsigned char *mbs, size_t mbs_len) > { > + int32_t idx; > + const unsigned char *extra = (const unsigned char *) > _NL_CURRENT (LC_COLLATE, _NL_COLLATE_SYMB_EXTRAMB); > + int32_t extrasize = (const unsigned char *) > _NL_CURRENT (LC_COLLATE, _NL_COLLATE_SYMB_EXTRAMB + 1) - extra; > + uint32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES); > + > + /* Only called from within 'if (nrules != 0)'. */ Trailing whitespace on the last line. Most of this is just reindentation, so okay. > + assert (nrules != 0); Right, the single caller is under nrules != 0. strcoll and strxfrm are already correct. Thanks, Florian