bug-gnulib@gnu.org mirror (unofficial)
 help / color / mirror / Atom feed
From: Adhemerval Zanella <adhemerval.zanella@linaro.org>
To: Paul Eggert <eggert@cs.ucla.edu>, bug-gnulib@gnu.org
Subject: Re: [PATCH 07/10] regex: fix longstanding backref match bug
Date: Tue, 9 Feb 2021 16:42:21 -0300	[thread overview]
Message-ID: <041bd09f-447e-40ca-70a6-1eab3993fe19@linaro.org> (raw)
In-Reply-To: <20210206012602.2257711-7-eggert@cs.ucla.edu>

Hi Paul,

Trying to sync gnulib with glibc code, this patch trigger some regression
on glibc testcases:

FAIL: posix/tst-boost
FAIL: posix/tst-pcre
FAIL: posix/tst-rxspencer
FAIL: posix/tst-rxspencer-no-utf8

$ grep -n "FAIL rm" posix/tst-boost.out
445:FAIL rm[1] 3..-1 != expected 2..3
448:FAIL rm[1] 3..-1 != expected 2..3
451:FAIL rm[1] 3..-1 != expected 2..3
454:FAIL rm[1] 3..-1 != expected 2..3

$ cat posix/tst-pcre.out 
1168: /([a]*)*/ on a unexpectedly failed to match register 1 1..-1
1171: /([a]*)*/ on aaaaa  unexpectedly failed to match register 1 5..-1
1176: /([ab]*)*/ on a unexpectedly failed to match register 1 1..-1
1179: /([ab]*)*/ on b unexpectedly failed to match register 1 1..-1
1182: /([ab]*)*/ on ababab unexpectedly failed to match register 1 6..-1
1185: /([ab]*)*/ on aaaabcde unexpectedly failed to match register 1 5..-1
1188: /([ab]*)*/ on bbbb     unexpectedly failed to match register 1 4..-1
1193: /([^a]*)*/ on b unexpectedly failed to match register 1 1..-1
1196: /([^a]*)*/ on bbbb unexpectedly failed to match register 1 4..-1
1203: /([^ab]*)*/ on cccc unexpectedly failed to match register 1 4..-1

$ cat posix/tst-rxspencer.out  | grep FAIL
FAIL rm[1] unexpectedly did not match
FAIL rm[1] unexpectedly did not match

$ cat posix/tst-rxspencer-no-utf8.out | grep FAIL
FAIL rm[1] unexpectedly did not match
FAIL rm[1] unexpectedly did not match

On 05/02/2021 22:25, Paul Eggert wrote:
> This fixes a longstanding glibc bug concerning backreferences
> <https://sourceware.org/11053> (2009-12-04).
> * lib/regexec.c (proceed_next_node, push_fail_stack)
> (pop_fail_stack): Push and pop the previous registers
> as well as the current ones.  All callers changed.
> (set_regs): Also pop if CUR_NODE has already been checked,
> so that it does not get added as a duplicate set entry.
> (update_regs): Fix comment location.
> * tests/test-regex.c (tests): New constant.
> (bug_regex11): New test function.
> (main): Bump alarm value.  Call new test function.
> ---
>  ChangeLog          |  13 ++++++
>  lib/regexec.c      |  26 +++++++----
>  tests/test-regex.c | 113 ++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 141 insertions(+), 11 deletions(-)
> 
> diff --git a/ChangeLog b/ChangeLog
> index 74304474b..bd7d1fa16 100644
> --- a/ChangeLog
> +++ b/ChangeLog
> @@ -1,5 +1,18 @@
>  2021-02-05  Paul Eggert  <eggert@cs.ucla.edu>
>  
> +	regex: fix longstanding backref match bug
> +	This fixes a longstanding glibc bug concerning backreferences
> +	<https://sourceware.org/11053> (2009-12-04).
> +	* lib/regexec.c (proceed_next_node, push_fail_stack)
> +	(pop_fail_stack): Push and pop the previous registers
> +	as well as the current ones.  All callers changed.
> +	(set_regs): Also pop if CUR_NODE has already been checked,
> +	so that it does not get added as a duplicate set entry.
> +	(update_regs): Fix comment location.
> +	* tests/test-regex.c (tests): New constant.
> +	(bug_regex11): New test function.
> +	(main): Bump alarm value.  Call new test function.
> +
>  	regex: avoid duplicate in espilon closure
>  	* lib/regcomp.c (calc_eclosure_iter): Insert NODE into epsilon
>  	closure first rather than last.  Otherwise, the epsilon closure
> diff --git a/lib/regexec.c b/lib/regexec.c
> index fdd2e373e..424bc8d15 100644
> --- a/lib/regexec.c
> +++ b/lib/regexec.c
> @@ -59,7 +59,7 @@ static void update_regs (const re_dfa_t *dfa, regmatch_t *pmatch,
>  			 Idx cur_idx, Idx nmatch);
>  static reg_errcode_t push_fail_stack (struct re_fail_stack_t *fs,
>  				      Idx str_idx, Idx dest_node, Idx nregs,
> -				      regmatch_t *regs,
> +				      regmatch_t *regs, regmatch_t *prevregs,
>  				      re_node_set *eps_via_nodes);
>  static reg_errcode_t set_regs (const regex_t *preg,
>  			       const re_match_context_t *mctx,
> @@ -1211,6 +1211,7 @@ check_halt_state_context (const re_match_context_t *mctx,
>  
>  static Idx
>  proceed_next_node (const re_match_context_t *mctx, Idx nregs, regmatch_t *regs,
> +		   regmatch_t *prevregs,
>  		   Idx *pidx, Idx node, re_node_set *eps_via_nodes,
>  		   struct re_fail_stack_t *fs)
>  {
> @@ -1243,7 +1244,7 @@ proceed_next_node (const re_match_context_t *mctx, Idx nregs, regmatch_t *regs,
>  	      /* Otherwise, push the second epsilon-transition on the fail stack.  */
>  	      else if (fs != NULL
>  		       && push_fail_stack (fs, *pidx, candidate, nregs, regs,
> -					   eps_via_nodes))
> +					   prevregs, eps_via_nodes))
>  		return -2;
>  
>  	      /* We know we are going to exit.  */
> @@ -1316,7 +1317,8 @@ proceed_next_node (const re_match_context_t *mctx, Idx nregs, regmatch_t *regs,
>  static reg_errcode_t
>  __attribute_warn_unused_result__
>  push_fail_stack (struct re_fail_stack_t *fs, Idx str_idx, Idx dest_node,
> -		 Idx nregs, regmatch_t *regs, re_node_set *eps_via_nodes)
> +		 Idx nregs, regmatch_t *regs, regmatch_t *prevregs,
> +		 re_node_set *eps_via_nodes)
>  {
>    reg_errcode_t err;
>    Idx num = fs->num++;
> @@ -1332,23 +1334,26 @@ push_fail_stack (struct re_fail_stack_t *fs, Idx str_idx, Idx dest_node,
>      }
>    fs->stack[num].idx = str_idx;
>    fs->stack[num].node = dest_node;
> -  fs->stack[num].regs = re_malloc (regmatch_t, nregs);
> +  fs->stack[num].regs = re_malloc (regmatch_t, 2 * nregs);
>    if (fs->stack[num].regs == NULL)
>      return REG_ESPACE;
>    memcpy (fs->stack[num].regs, regs, sizeof (regmatch_t) * nregs);
> +  memcpy (fs->stack[num].regs + nregs, prevregs, sizeof (regmatch_t) * nregs);
>    err = re_node_set_init_copy (&fs->stack[num].eps_via_nodes, eps_via_nodes);
>    return err;
>  }
>  
>  static Idx
>  pop_fail_stack (struct re_fail_stack_t *fs, Idx *pidx, Idx nregs,
> -		regmatch_t *regs, re_node_set *eps_via_nodes)
> +		regmatch_t *regs, regmatch_t *prevregs,
> +		re_node_set *eps_via_nodes)
>  {
>    if (fs == NULL || fs->num == 0)
>      return -1;
>    Idx num = --fs->num;
>    *pidx = fs->stack[num].idx;
>    memcpy (regs, fs->stack[num].regs, sizeof (regmatch_t) * nregs);
> +  memcpy (prevregs, fs->stack[num].regs + nregs, sizeof (regmatch_t) * nregs);
>    re_node_set_free (eps_via_nodes);
>    re_free (fs->stack[num].regs);
>    *eps_via_nodes = fs->stack[num].eps_via_nodes;
> @@ -1408,7 +1413,8 @@ set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
>      {
>        update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
>  
> -      if (idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
> +      if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
> +	  || re_node_set_contains (&eps_via_nodes, cur_node))
>  	{
>  	  Idx reg_idx;
>  	  cur_node = -1;
> @@ -1418,7 +1424,7 @@ set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
>  		if (pmatch[reg_idx].rm_so > -1 && pmatch[reg_idx].rm_eo == -1)
>  		  {
>  		    cur_node = pop_fail_stack (fs, &idx, nmatch, pmatch,
> -					       &eps_via_nodes);
> +					       prev_idx_match, &eps_via_nodes);
>  		    break;
>  		  }
>  	    }
> @@ -1431,7 +1437,8 @@ set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
>  	}
>  
>        /* Proceed to next node.  */
> -      cur_node = proceed_next_node (mctx, nmatch, pmatch, &idx, cur_node,
> +      cur_node = proceed_next_node (mctx, nmatch, pmatch, prev_idx_match,
> +				    &idx, cur_node,
>  				    &eps_via_nodes, fs);
>  
>        if (__glibc_unlikely (cur_node < 0))
> @@ -1443,7 +1450,8 @@ set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
>  	      free_fail_stack_return (fs);
>  	      return REG_ESPACE;
>  	    }
> -	  cur_node = pop_fail_stack (fs, &idx, nmatch, pmatch, &eps_via_nodes);
> +	  cur_node = pop_fail_stack (fs, &idx, nmatch, pmatch,
> +				     prev_idx_match, &eps_via_nodes);
>  	  if (cur_node < 0)
>  	    {
>  	      re_node_set_free (&eps_via_nodes);
> diff --git a/tests/test-regex.c b/tests/test-regex.c
> index 58f8200f2..a14619805 100644
> --- a/tests/test-regex.c
> +++ b/tests/test-regex.c
> @@ -55,6 +55,111 @@ really_utf8 (void)
>    return strcmp (locale_charset (), "UTF-8") == 0;
>  }
>  
> +/* Tests supposed to match; copied from glibc posix/bug-regex11.c.  */
> +static struct
> +{
> +  const char *pattern;
> +  const char *string;
> +  int flags, nmatch;
> +  regmatch_t rm[5];
> +} const tests[] = {
> +  /* Test for newline handling in regex.  */
> +  { "[^~]*~", "\nx~y", 0, 2, { { 0, 3 }, { -1, -1 } } },
> +  /* Other tests.  */
> +  { "a(.*)b", "a b", REG_EXTENDED, 2, { { 0, 3 }, { 1, 2 } } },
> +  { ".*|\\([KIO]\\)\\([^|]*\\).*|?[KIO]", "10~.~|P|K0|I10|O16|?KSb", 0, 3,
> +    { { 0, 21 }, { 15, 16 }, { 16, 18 } } },
> +  { ".*|\\([KIO]\\)\\([^|]*\\).*|?\\1", "10~.~|P|K0|I10|O16|?KSb", 0, 3,
> +    { { 0, 21 }, { 8, 9 }, { 9, 10 } } },
> +  { "^\\(a*\\)\\1\\{9\\}\\(a\\{0,9\\}\\)\\([0-9]*;.*[^a]\\2\\([0-9]\\)\\)",
> +    "a1;;0a1aa2aaa3aaaa4aaaaa5aaaaaa6aaaaaaa7aaaaaaaa8aaaaaaaaa9aa2aa1a0", 0,
> +    5, { { 0, 67 }, { 0, 0 }, { 0, 1 }, { 1, 67 }, { 66, 67 } } },
> +  /* Test for BRE expression anchoring.  POSIX says just that this may match;
> +     in glibc regex it always matched, so avoid changing it.  */
> +  { "\\(^\\|foo\\)bar", "bar", 0, 2, { { 0, 3 }, { -1, -1 } } },
> +  { "\\(foo\\|^\\)bar", "bar", 0, 2, { { 0, 3 }, { -1, -1 } } },
> +  /* In ERE this must be treated as an anchor.  */
> +  { "(^|foo)bar", "bar", REG_EXTENDED, 2, { { 0, 3 }, { -1, -1 } } },
> +  { "(foo|^)bar", "bar", REG_EXTENDED, 2, { { 0, 3 }, { -1, -1 } } },
> +  /* Here ^ cannot be treated as an anchor according to POSIX.  */
> +  { "(^|foo)bar", "(^|foo)bar", 0, 2, { { 0, 10 }, { -1, -1 } } },
> +  { "(foo|^)bar", "(foo|^)bar", 0, 2, { { 0, 10 }, { -1, -1 } } },
> +  /* More tests on backreferences.  */
> +  { "()\\1", "x", REG_EXTENDED, 2, { { 0, 0 }, { 0, 0 } } },
> +  { "()x\\1", "x", REG_EXTENDED, 2, { { 0, 1 }, { 0, 0 } } },
> +  { "()\\1*\\1*", "", REG_EXTENDED, 2, { { 0, 0 }, { 0, 0 } } },
> +  { "([0-9]).*\\1(a*)", "7;7a6", REG_EXTENDED, 3, { { 0, 4 }, { 0, 1 }, { 3, 4 } } },
> +  { "([0-9]).*\\1(a*)", "7;7a", REG_EXTENDED, 3, { { 0, 4 }, { 0, 1 }, { 3, 4 } } },
> +  { "(b)()c\\1", "bcb", REG_EXTENDED, 3, { { 0, 3 }, { 0, 1 }, { 1, 1 } } },
> +  { "()(b)c\\2", "bcb", REG_EXTENDED, 3, { { 0, 3 }, { 0, 0 }, { 0, 1 } } },
> +  { "a(b)()c\\1", "abcb", REG_EXTENDED, 3, { { 0, 4 }, { 1, 2 }, { 2, 2 } } },
> +  { "a()(b)c\\2", "abcb", REG_EXTENDED, 3, { { 0, 4 }, { 1, 1 }, { 1, 2 } } },
> +  { "()(b)\\1c\\2", "bcb", REG_EXTENDED, 3, { { 0, 3 }, { 0, 0 }, { 0, 1 } } },
> +  { "(b())\\2\\1", "bbbb", REG_EXTENDED, 3, { { 0, 2 }, { 0, 1 }, { 1, 1 } } },
> +  { "a()(b)\\1c\\2", "abcb", REG_EXTENDED, 3, { { 0, 4 }, { 1, 1 }, { 1, 2 } } },
> +  { "a()d(b)\\1c\\2", "adbcb", REG_EXTENDED, 3, { { 0, 5 }, { 1, 1 }, { 2, 3 } } },
> +  { "a(b())\\2\\1", "abbbb", REG_EXTENDED, 3, { { 0, 3 }, { 1, 2 }, { 2, 2 } } },
> +  { "(bb())\\2\\1", "bbbb", REG_EXTENDED, 3, { { 0, 4 }, { 0, 2 }, { 2, 2 } } },
> +  { "^([^,]*),\\1,\\1$", "a,a,a", REG_EXTENDED, 2, { { 0, 5 }, { 0, 1 } } },
> +  { "^([^,]*),\\1,\\1$", "ab,ab,ab", REG_EXTENDED, 2, { { 0, 8 }, { 0, 2 } } },
> +  { "^([^,]*),\\1,\\1,\\1$", "abc,abc,abc,abc", REG_EXTENDED, 2,
> +    { { 0, 15 }, { 0, 3 } } },
> +  { "^(.?)(.?)(.?)(.?)(.?).?\\5\\4\\3\\2\\1$",
> +    "level", REG_NOSUB | REG_EXTENDED, 0, { { -1, -1 } } },
> +  { "^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.).?\\9\\8\\7\\6\\5\\4\\3\\2\\1$|^.?$",
> +    "level", REG_NOSUB | REG_EXTENDED, 0, { { -1, -1 } } },
> +  { "^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.).?\\9\\8\\7\\6\\5\\4\\3\\2\\1$|^.?$",
> +    "abcdedcba", REG_EXTENDED, 1, { { 0, 9 } } },
> +  /* XXX Not used since they fail so far.  */
> +  { "^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.).?\\9\\8\\7\\6\\5\\4\\3\\2\\1$|^.?$",
> +    "ababababa", REG_EXTENDED, 1, { { 0, 9 } } },
> +  { "^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?).?\\9\\8\\7\\6\\5\\4\\3\\2\\1$",
> +    "level", REG_NOSUB | REG_EXTENDED, 0, { { -1, -1 } } },
> +  { "^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?).?\\9\\8\\7\\6\\5\\4\\3\\2\\1$",
> +    "ababababa", REG_EXTENDED, 1, { { 0, 9 } } },
> +};
> +
> +static void
> +bug_regex11 (void)
> +{
> +  regex_t re;
> +  regmatch_t rm[5];
> +  size_t i;
> +  int n;
> +
> +  for (i = 0; i < sizeof (tests) / sizeof (tests[0]); ++i)
> +    {
> +      n = regcomp (&re, tests[i].pattern, tests[i].flags);
> +      if (n != 0)
> +	{
> +	  char buf[500];
> +	  regerror (n, &re, buf, sizeof (buf));
> +	  report_error ("%s: regcomp %zd failed: %s", tests[i].pattern, i, buf);
> +	  continue;
> +	}
> +
> +      if (regexec (&re, tests[i].string, tests[i].nmatch, rm, 0))
> +	{
> +	  report_error ("%s: regexec %zd failed", tests[i].pattern, i);
> +	  regfree (&re);
> +	  continue;
> +	}
> +
> +      for (n = 0; n < tests[i].nmatch; ++n)
> +	if (rm[n].rm_so != tests[i].rm[n].rm_so
> +              || rm[n].rm_eo != tests[i].rm[n].rm_eo)
> +	  {
> +	    if (tests[i].rm[n].rm_so == -1 && tests[i].rm[n].rm_eo == -1)
> +	      break;
> +	    report_error ("%s: regexec %zd match failure rm[%d] %d..%d",
> +                          tests[i].pattern, i, n, rm[n].rm_so, rm[n].rm_eo);
> +	    break;
> +	  }
> +
> +      regfree (&re);
> +    }
> +}
> +
>  int
>  main (void)
>  {
> @@ -65,11 +170,15 @@ main (void)
>    struct re_registers regs;
>  
>  #if HAVE_DECL_ALARM
> -  /* Some builds of glibc go into an infinite loop on this test.  */
> -  int alarm_value = 2;
> +  /* In case a bug causes glibc to go into an infinite loop.
> +     The tests should take less than 10 s on a reasonably modern CPU.  */
> +  int alarm_value = 1000;
>    signal (SIGALRM, SIG_DFL);
>    alarm (alarm_value);
>  #endif
> +
> +  bug_regex11 ();
> +
>    if (setlocale (LC_ALL, "en_US.UTF-8"))
>      {
>        {
> 


  reply	other threads:[~2021-02-09 20:23 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-06  1:25 [PATCH 01/10] regex: improve comments Paul Eggert
2021-02-06  1:25 ` [PATCH 02/10] regex: avoid undefined behavior Paul Eggert
2021-02-06  1:25 ` [PATCH 03/10] regex: minor refactoring Paul Eggert
2021-02-06  1:25 ` [PATCH 04/10] regex: make it easier to merge into glibc Paul Eggert
2021-02-06  1:25 ` [PATCH 05/10] regex-tests: fix typo Paul Eggert
2021-02-06  1:25 ` [PATCH 06/10] regex: avoid duplicate in espilon closure Paul Eggert
2021-02-06  1:25 ` [PATCH 07/10] regex: fix longstanding backref match bug Paul Eggert
2021-02-09 19:42   ` Adhemerval Zanella [this message]
2021-02-09 22:22     ` Paul Eggert
2021-04-16 21:37   ` Dmitry V. Levin
2021-02-06  1:26 ` [PATCH 08/10] regex: debug check for set member duplicates Paul Eggert
2021-02-06  1:26 ` [PATCH 09/10] regex-tests: add bug 11053 test Paul Eggert
2021-02-06  1:26 ` [PATCH 10/10] regex: fix comment location Paul Eggert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://lists.gnu.org/mailman/listinfo/bug-gnulib

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=041bd09f-447e-40ca-70a6-1eab3993fe19@linaro.org \
    --to=adhemerval.zanella@linaro.org \
    --cc=bug-gnulib@gnu.org \
    --cc=eggert@cs.ucla.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).