From mboxrd@z Thu Jan 1 00:00:00 1970 From: Anthony Ramine Subject: Re: [PATCH v5] wildmatch: properly fold case everywhere Date: Mon, 3 Jun 2013 01:42:51 +0200 Message-ID: <586F64C2-0F44-4DAB-B91A-DB88C624FEC2@gmail.com> References: <1369903506-72731-1-git-send-email-n.oxyde@gmail.com> <1369909150-73114-1-git-send-email-n.oxyde@gmail.com> <7vppw4mb77.fsf@alter.siamese.dyndns.org> Mime-Version: 1.0 (Apple Message framework v1283) Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Git Mailing List , =?utf-8?Q?Nguy=E1=BB=85n_Th=C3=A1i_Ng=E1=BB=8Dc_Duy?= To: Junio C Hamano X-From: git-owner@vger.kernel.org Mon Jun 03 01:43:03 2013 Return-path: Envelope-to: gcvg-git-2@plane.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1UjHvQ-0003KZ-CV for gcvg-git-2@plane.gmane.org; Mon, 03 Jun 2013 01:43:00 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755516Ab3FBXm5 convert rfc822-to-quoted-printable (ORCPT ); Sun, 2 Jun 2013 19:42:57 -0400 Received: from mail-wi0-f175.google.com ([209.85.212.175]:47361 "EHLO mail-wi0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755224Ab3FBXmz convert rfc822-to-8bit (ORCPT ); Sun, 2 Jun 2013 19:42:55 -0400 Received: by mail-wi0-f175.google.com with SMTP id hn14so2165145wib.2 for ; Sun, 02 Jun 2013 16:42:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer; bh=tXNn7ALF68he/kMEJpWO4pj0zGIFmGdlNdKQ9rf2EHM=; b=pFVplehqYNvXiW2o223+6FRRUQuLvPPxVEJhVC1cyfngCF/IddccBgGNalnQ/5S0O9 5/ib0Ye5Wt5BsKKr1OlSq6k+Btd1vrMSr/O8tXfD2CDP+vbAffgS10VeUIF7WHkf5yd+ tX2PmJl9MahXaDg1hSTK1LVB3C7grziS+Ffhssl46Z4rhSKk7S1QezD4mPPdTH0/qDhU 3kTJcNRk7ztmVgu8Z6bCiB6Li27d03Bwn/4CXfNZVCgCMs14WK1KTEqpn7suE0I1am5b lkmPaqwgB4UUpswX0UT+vscRk12yMe99c+Gv2m04zTprz07n0eqL9v9uXVIj0xxBSf/z XP6g== X-Received: by 10.180.198.49 with SMTP id iz17mr10363212wic.39.1370216573962; Sun, 02 Jun 2013 16:42:53 -0700 (PDT) Received: from [192.168.1.44] (vol75-16-88-182-187-128.fbx.proxad.net. [88.182.187.128]) by mx.google.com with ESMTPSA id d10sm19868034wik.0.2013.06.02.16.42.52 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sun, 02 Jun 2013 16:42:52 -0700 (PDT) In-Reply-To: <7vppw4mb77.fsf@alter.siamese.dyndns.org> X-Mailer: Apple Mail (2.1283) Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: Hello Junio, Replied inline. Regards, --=20 Anthony Ramine Le 2 juin 2013 =E0 23:53, Junio C Hamano a =E9crit : > Anthony Ramine writes: >=20 >> ase folding is not done correctly when matching against the [:upper:= ] >> character class and uppercased character ranges (e.g. A-Z). >> Specifically, an uppercase letter fails to match against any of them >> when case folding is requested because plain characters in the patte= rn >> and the whole string are preemptively lowercased to handle the base = case >> fast. >>=20 >> That optimization is kept and ISLOWER() is used in the [:upper:] cas= e >> when case folding is requested, while matching against a character r= ange >> is retried with toupper() if the character was lowercase, as the bou= nds >> of the range itself cannot be modified (in a case-insensitive contex= t, >> [A-_] is not equivalent to [a-_]). >>=20 >> Signed-off-by: Anthony Ramine >> Reviewed-by: Duy Nguyen >> --- >=20 > Thanks. >=20 >> t/t3070-wildmatch.sh | 55 ++++++++++++++++++++++++++++++++++++++++++= ++++------ >> wildmatch.c | 7 +++++++ >> 2 files changed, 56 insertions(+), 6 deletions(-) >>=20 >> I added Duy as reviewer and fixed a typo in the commit message repor= ted by >> Eric Sunshine. >>=20 >> diff --git a/t/t3070-wildmatch.sh b/t/t3070-wildmatch.sh >> index 4c37057..38446a0 100755 >> --- a/t/t3070-wildmatch.sh >> +++ b/t/t3070-wildmatch.sh >> @@ -6,20 +6,20 @@ test_description=3D'wildmatch tests' >>=20 >> match() { >> if [ $1 =3D 1 ]; then >> - test_expect_success "wildmatch: match '$3' '$4'" " >> + test_expect_success "wildmatch: match '$3' '$4'" " >> test-wildmatch wildmatch '$3' '$4' >> " >> else >> - test_expect_success "wildmatch: no match '$3' '$4'" " >> + test_expect_success "wildmatch: no match '$3' '$4'" " >> ! test-wildmatch wildmatch '$3' '$4' >> " >> fi >> if [ $2 =3D 1 ]; then >> - test_expect_success "fnmatch: match '$3' '$4'" " >> + test_expect_success "fnmatch: match '$3' '$4'" " >> test-wildmatch fnmatch '$3' '$4' >> " >> elif [ $2 =3D 0 ]; then >> - test_expect_success "fnmatch: no match '$3' '$4'" " >> + test_expect_success "fnmatch: no match '$3' '$4'" " >> ! test-wildmatch fnmatch '$3' '$4' >> " >> # else >=20 > Is the above about aligning $3/$4 across checks of different types > (i.e. purely cosmetic)? I am not complaining; just making sure if > there is nothing deeper going on. Yes purely cosmetic. > It is outside the scope of this change, but the shell style of this > script (most notably use of [] instead of test) needs to be fixed > someday, preferrably soon, including the commented out else clause > at the end of the hunk. >=20 >> @@ -235,4 +247,35 @@ pathmatch 1 abcXdefXghi '*X*i' >> pathmatch 1 ab/cXd/efXg/hi '*/*X*/*/*i' >> pathmatch 1 ab/cXd/efXg/hi '*Xg*i' >>=20 >> +# Case-sensitivy features >> +match 0 x 'a' '[A-Z]' >> +match 1 x 'A' '[A-Z]' >> +match 0 x 'A' '[a-z]' >> +match 1 x 'a' '[a-z]' >> +match 0 x 'a' '[[:upper:]]' >> +match 1 x 'A' '[[:upper:]]' >> +match 0 x 'A' '[[:lower:]]' >> +match 1 x 'a' '[[:lower:]]' >> +match 0 x 'A' '[B-Za]' >> +match 1 x 'a' '[B-Za]' >> +match 0 x 'A' '[B-a]' >> +match 1 x 'a' '[B-a]' >> +match 0 x 'z' '[Z-y]' >> +match 1 x 'Z' '[Z-y]' >> + >> +imatch 1 'a' '[A-Z]' >=20 > Do we want "# Case-insensitivity features" commment here as well? I don't particularly care, some sections have titles, some don't. >> +imatch 1 'A' '[A-Z]' >> +imatch 1 'A' '[a-z]' >> +imatch 1 'a' '[a-z]' >> +imatch 1 'a' '[[:upper:]]' >> +imatch 1 'A' '[[:upper:]]' >> +imatch 1 'A' '[[:lower:]]' >> +imatch 1 'a' '[[:lower:]]' >> +imatch 1 'A' '[B-Za]' >> +imatch 1 'a' '[B-Za]' >> +imatch 1 'A' '[B-a]' >> +imatch 1 'a' '[B-a]' >> +imatch 1 'z' '[Z-y]' >> +imatch 1 'Z' '[Z-y]' >=20 >> + >> test_done >> diff --git a/wildmatch.c b/wildmatch.c >> index 7192bdc..f91ba99 100644 >> --- a/wildmatch.c >> +++ b/wildmatch.c >> @@ -196,6 +196,11 @@ static int dowild(const uchar *p, const uchar *= text, unsigned int flags) >> } >> if (t_ch <=3D p_ch && t_ch >=3D prev_ch) >> matched =3D 1; >> + else if ((flags & WM_CASEFOLD) && ISLOWER(t_ch)) { >> + uchar t_ch_upper =3D toupper(t_ch); >> + if (t_ch_upper <=3D p_ch && t_ch_upper >=3D prev_ch) >> + matched =3D 1; >> + } >> p_ch =3D 0; /* This makes "prev_ch" get set to 0. */ >=20 > Hmm, this looks somewhat strange. >=20 > * At the beginning of the outermost "per characters in the text" > loop, we seem to downcase t_ch when WM_CASEFOLD is set. > * Also at the same place, we also seem to downcase p_ch under the > same condition. >=20 > which makes me wonder why the fix is not like this: >=20 > + if (flags & WM_CASEFOLD) { > + if (ISUPPER(p_ch)) > + p_ch =3D tolower(p_ch); > + if (prev_ch && ISUPPER(prev_ch)) > + prev_ch =3D tolower(prev_ch); > + } > if (t_ch <=3D p_ch && t_ch >=3D prev_ch) > matched =3D 1; > p_ch =3D 0; /* This sets "prev_ch" to 0 */ >=20 >=20 > Ahh, OK, the "seemingly strange" construct is about handling a range > like "[Z-y]"; we do not want to upcase or downcase the p_ch/prev_ch > make the range "[z-y]" (empty) which would exclude bytes like "^", > "_" or even "Z". >=20 > And it is also OK to downcase p_ch in a single-letter case, not the > characters in a range, at the beginning of the outermost loop, > because we always compare for equality against t_ch (which is > downcased) in that case. Yeah I tried to explain that in the commit message but it is surely too= dense. >> @@ -245,6 +250,8 @@ static int dowild(const uchar *p, const uchar *t= ext, unsigned int flags) >> } else if (CC_EQ(s,i, "upper")) { >> if (ISUPPER(t_ch)) >> matched =3D 1; >> + else if ((flags & WM_CASEFOLD) && ISLOWER(t_ch)) >> + matched =3D 1; >=20 > This also looks somewhat strange but correct in that t_ch is already > downcased so we do not need a corresponding change for CC_EQ("lower") > codepath. I chose to still lowercase everything first, to keep the common path as= is. > Interesting. Will apply. >=20 > Thanks. Great. You're welcome.