bug-gnulib@gnu.org mirror (unofficial)
 help / color / mirror / Atom feed
* [PATCH 01/10] regex: improve comments
@ 2021-02-06  1:25 Paul Eggert
  2021-02-06  1:25 ` [PATCH 02/10] regex: avoid undefined behavior Paul Eggert
                   ` (8 more replies)
  0 siblings, 9 replies; 13+ messages in thread
From: Paul Eggert @ 2021-02-06  1:25 UTC (permalink / raw)
  To: bug-gnulib; +Cc: Paul Eggert

* lib/regexec.c: Add and correct comments about return values.
---
 ChangeLog     |  5 +++++
 lib/regexec.c | 21 +++++++++++++--------
 2 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 717e51b16..fef04a89c 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,8 @@
+2021-02-05  Paul Eggert  <eggert@cs.ucla.edu>
+
+	regex: improve comments
+	* lib/regexec.c: Add and correct comments about return values.
+
 2021-01-31  Bruno Haible  <bruno@clisp.org>
 
 	relocatable-prog-wrapper: Tweak today's patch.
diff --git a/lib/regexec.c b/lib/regexec.c
index f7b4f9cfc..fa413dfd5 100644
--- a/lib/regexec.c
+++ b/lib/regexec.c
@@ -186,7 +186,8 @@ static reg_errcode_t extend_buffers (re_match_context_t *mctx, int min_len);
    REG_NOTBOL is set, then ^ does not match at the beginning of the
    string; if REG_NOTEOL is set, then $ does not match at the end.
 
-   We return 0 if we find a match and REG_NOMATCH if not.  */
+   Return 0 if a match is found, REG_NOMATCH if not, REG_BADPAT if
+   EFLAGS is invalid.  */
 
 int
 regexec (const regex_t *__restrict preg, const char *__restrict string,
@@ -269,8 +270,8 @@ compat_symbol (libc, __compat_regexec, regexec, GLIBC_2_0);
    strings.)
 
    On success, re_match* functions return the length of the match, re_search*
-   return the position of the start of the match.  Return value -1 means no
-   match was found and -2 indicates an internal error.  */
+   return the position of the start of the match.  They return -1 on
+   match failure, -2 on error.  */
 
 regoff_t
 re_match (struct re_pattern_buffer *bufp, const char *string, Idx length,
@@ -1206,7 +1207,7 @@ check_halt_state_context (const re_match_context_t *mctx,
 /* Compute the next node to which "NFA" transit from NODE("NFA" is a NFA
    corresponding to the DFA).
    Return the destination node, and update EPS_VIA_NODES;
-   return -1 in case of errors.  */
+   return -1 on match failure, -2 on error.  */
 
 static Idx
 proceed_next_node (const re_match_context_t *mctx, Idx nregs, regmatch_t *regs,
@@ -2195,6 +2196,7 @@ sift_states_iter_mb (const re_match_context_t *mctx, re_sift_context_t *sctx,
 
 /* Return the next state to which the current state STATE will transit by
    accepting the current input byte, and update STATE_LOG if necessary.
+   Return NULL on failure.
    If STATE can accept a multibyte char/collating element/back reference
    update the destination of STATE_LOG.  */
 
@@ -2395,7 +2397,7 @@ check_subexp_matching_top (re_match_context_t *mctx, re_node_set *cur_nodes,
 
 #if 0
 /* Return the next state to which the current state STATE will transit by
-   accepting the current input byte.  */
+   accepting the current input byte.  Return NULL on failure.  */
 
 static re_dfastate_t *
 transit_state_sb (reg_errcode_t *err, re_match_context_t *mctx,
@@ -2817,7 +2819,8 @@ find_subexp_node (const re_dfa_t *dfa, const re_node_set *nodes,
 /* Check whether the node TOP_NODE at TOP_STR can arrive to the node
    LAST_NODE at LAST_STR.  We record the path onto PATH since it will be
    heavily reused.
-   Return REG_NOERROR if it can arrive, or REG_NOMATCH otherwise.  */
+   Return REG_NOERROR if it can arrive, REG_NOMATCH if it cannot,
+   REG_ESPACE if memory is exhausted.  */
 
 static reg_errcode_t
 __attribute_warn_unused_result__
@@ -3433,7 +3436,8 @@ build_trtable (const re_dfa_t *dfa, re_dfastate_t *state)
 /* Group all nodes belonging to STATE into several destinations.
    Then for all destinations, set the nodes belonging to the destination
    to DESTS_NODE[i] and set the characters accepted by the destination
-   to DEST_CH[i].  This function return the number of destinations.  */
+   to DEST_CH[i].  Return the number of destinations if successful,
+   -1 on internal error.  */
 
 static Idx
 group_nodes_into_DFAstates (const re_dfa_t *dfa, const re_dfastate_t *state,
@@ -4211,7 +4215,8 @@ match_ctx_add_subtop (re_match_context_t *mctx, Idx node, Idx str_idx)
 }
 
 /* Register the node NODE, whose type is OP_CLOSE_SUBEXP, and which matches
-   at STR_IDX, whose corresponding OP_OPEN_SUBEXP is SUB_TOP.  */
+   at STR_IDX, whose corresponding OP_OPEN_SUBEXP is SUB_TOP.
+   Return the new entry if successful, NULL if memory is exhausted.  */
 
 static re_sub_match_last_t *
 match_ctx_add_sublast (re_sub_match_top_t *subtop, Idx node, Idx str_idx)
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 02/10] regex: avoid undefined behavior
  2021-02-06  1:25 [PATCH 01/10] regex: improve comments Paul Eggert
@ 2021-02-06  1:25 ` Paul Eggert
  2021-02-06  1:25 ` [PATCH 03/10] regex: minor refactoring Paul Eggert
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Paul Eggert @ 2021-02-06  1:25 UTC (permalink / raw)
  To: bug-gnulib; +Cc: Paul Eggert

* lib/regexec.c (pop_fail_stack): If the stack is empty, return -1
instead of indulging in undefined behavior.  This simplifies
callers, and avoids undefined behavior in some cases (see glibc
bug 11053, though this change does not fix that overall bug).
---
 ChangeLog     |  6 ++++++
 lib/regexec.c | 30 ++++++++++++++----------------
 2 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index fef04a89c..6fcd5819f 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,11 @@
 2021-02-05  Paul Eggert  <eggert@cs.ucla.edu>
 
+	regex: avoid undefined behavior
+	* lib/regexec.c (pop_fail_stack): If the stack is empty, return -1
+	instead of indulging in undefined behavior.  This simplifies
+	callers, and avoids undefined behavior in some cases (see glibc
+	bug 11053, though this change does not fix that overall bug).
+
 	regex: improve comments
 	* lib/regexec.c: Add and correct comments about return values.
 
diff --git a/lib/regexec.c b/lib/regexec.c
index fa413dfd5..f982e3aba 100644
--- a/lib/regexec.c
+++ b/lib/regexec.c
@@ -1346,13 +1346,15 @@ static Idx
 pop_fail_stack (struct re_fail_stack_t *fs, Idx *pidx, Idx nregs,
 		regmatch_t *regs, re_node_set *eps_via_nodes)
 {
+  if (fs == NULL || fs->num == 0)
+    return -1;
   Idx num = --fs->num;
-  DEBUG_ASSERT (num >= 0);
   *pidx = fs->stack[num].idx;
   memcpy (regs, fs->stack[num].regs, sizeof (regmatch_t) * nregs);
   re_node_set_free (eps_via_nodes);
   re_free (fs->stack[num].regs);
   *eps_via_nodes = fs->stack[num].eps_via_nodes;
+  DEBUG_ASSERT (0 <= fs->stack[num].node);
   return fs->stack[num].node;
 }
 
@@ -1411,25 +1413,22 @@ set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
       if (idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
 	{
 	  Idx reg_idx;
+	  cur_node = -1;
 	  if (fs)
 	    {
 	      for (reg_idx = 0; reg_idx < nmatch; ++reg_idx)
 		if (pmatch[reg_idx].rm_so > -1 && pmatch[reg_idx].rm_eo == -1)
-		  break;
-	      if (reg_idx == nmatch)
-		{
-		  re_node_set_free (&eps_via_nodes);
-		  regmatch_list_free (&prev_match);
-		  return free_fail_stack_return (fs);
-		}
-	      cur_node = pop_fail_stack (fs, &idx, nmatch, pmatch,
-					 &eps_via_nodes);
+		  {
+		    cur_node = pop_fail_stack (fs, &idx, nmatch, pmatch,
+					       &eps_via_nodes);
+		    break;
+		  }
 	    }
-	  else
+	  if (cur_node < 0)
 	    {
 	      re_node_set_free (&eps_via_nodes);
 	      regmatch_list_free (&prev_match);
-	      return REG_NOERROR;
+	      return free_fail_stack_return (fs);
 	    }
 	}
 
@@ -1446,13 +1445,12 @@ set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
 	      free_fail_stack_return (fs);
 	      return REG_ESPACE;
 	    }
-	  if (fs)
-	    cur_node = pop_fail_stack (fs, &idx, nmatch, pmatch,
-				       &eps_via_nodes);
-	  else
+	  cur_node = pop_fail_stack (fs, &idx, nmatch, pmatch, &eps_via_nodes);
+	  if (cur_node < 0)
 	    {
 	      re_node_set_free (&eps_via_nodes);
 	      regmatch_list_free (&prev_match);
+	      free_fail_stack_return (fs);
 	      return REG_NOMATCH;
 	    }
 	}
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 03/10] regex: minor refactoring
  2021-02-06  1:25 [PATCH 01/10] regex: improve comments Paul Eggert
  2021-02-06  1:25 ` [PATCH 02/10] regex: avoid undefined behavior Paul Eggert
@ 2021-02-06  1:25 ` Paul Eggert
  2021-02-06  1:25 ` [PATCH 04/10] regex: make it easier to merge into glibc Paul Eggert
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Paul Eggert @ 2021-02-06  1:25 UTC (permalink / raw)
  To: bug-gnulib; +Cc: Paul Eggert

* lib/regexec.c (proceed_next_node): Use more-local decls.
---
 ChangeLog     |  3 +++
 lib/regexec.c | 14 ++++++--------
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 6fcd5819f..fdc107673 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,8 @@
 2021-02-05  Paul Eggert  <eggert@cs.ucla.edu>
 
+	regex: minor refactoring
+	* lib/regexec.c (proceed_next_node): Use more-local decls.
+
 	regex: avoid undefined behavior
 	* lib/regexec.c (pop_fail_stack): If the stack is empty, return -1
 	instead of indulging in undefined behavior.  This simplifies
diff --git a/lib/regexec.c b/lib/regexec.c
index f982e3aba..fdd2e373e 100644
--- a/lib/regexec.c
+++ b/lib/regexec.c
@@ -1215,19 +1215,17 @@ proceed_next_node (const re_match_context_t *mctx, Idx nregs, regmatch_t *regs,
 		   struct re_fail_stack_t *fs)
 {
   const re_dfa_t *const dfa = mctx->dfa;
-  Idx i;
-  bool ok;
   if (IS_EPSILON_NODE (dfa->nodes[node].type))
     {
       re_node_set *cur_nodes = &mctx->state_log[*pidx]->nodes;
       re_node_set *edests = &dfa->edests[node];
-      Idx dest_node;
-      ok = re_node_set_insert (eps_via_nodes, node);
+      bool ok = re_node_set_insert (eps_via_nodes, node);
       if (__glibc_unlikely (! ok))
 	return -2;
-      /* Pick up a valid destination, or return -1 if none
-	 is found.  */
-      for (dest_node = -1, i = 0; i < edests->nelem; ++i)
+
+      /* Pick a valid destination, or return -1 if none is found.  */
+      Idx dest_node = -1;
+      for (Idx i = 0; i < edests->nelem; i++)
 	{
 	  Idx candidate = edests->elems[i];
 	  if (!re_node_set_contains (cur_nodes, candidate))
@@ -1289,7 +1287,7 @@ proceed_next_node (const re_match_context_t *mctx, Idx nregs, regmatch_t *regs,
 	  if (naccepted == 0)
 	    {
 	      Idx dest_node;
-	      ok = re_node_set_insert (eps_via_nodes, node);
+	      bool ok = re_node_set_insert (eps_via_nodes, node);
 	      if (__glibc_unlikely (! ok))
 		return -2;
 	      dest_node = dfa->edests[node].elems[0];
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 04/10] regex: make it easier to merge into glibc
  2021-02-06  1:25 [PATCH 01/10] regex: improve comments Paul Eggert
  2021-02-06  1:25 ` [PATCH 02/10] regex: avoid undefined behavior Paul Eggert
  2021-02-06  1:25 ` [PATCH 03/10] regex: minor refactoring Paul Eggert
@ 2021-02-06  1:25 ` Paul Eggert
  2021-02-06  1:25 ` [PATCH 05/10] regex-tests: fix typo Paul Eggert
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Paul Eggert @ 2021-02-06  1:25 UTC (permalink / raw)
  To: bug-gnulib; +Cc: Paul Eggert

* lib/regex_internal.h [_LIBC]: Do not include Gnulib’s dynarray.h.
---
 ChangeLog            | 3 +++
 lib/regex_internal.h | 5 ++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/ChangeLog b/ChangeLog
index fdc107673..82aa61a04 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,8 @@
 2021-02-05  Paul Eggert  <eggert@cs.ucla.edu>
 
+	regex: make it easier to merge into glibc
+	* lib/regex_internal.h [_LIBC]: Do not include Gnulib’s dynarray.h.
+
 	regex: minor refactoring
 	* lib/regexec.c (proceed_next_node): Use more-local decls.
 
diff --git a/lib/regex_internal.h b/lib/regex_internal.h
index 3fa2bf1aa..4b0a3efb6 100644
--- a/lib/regex_internal.h
+++ b/lib/regex_internal.h
@@ -32,7 +32,10 @@
 #include <stdbool.h>
 #include <stdint.h>
 
-#include <dynarray.h>
+#ifndef _LIBC
+# include <dynarray.h>
+#endif
+
 #include <intprops.h>
 #include <verify.h>
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 05/10] regex-tests: fix typo
  2021-02-06  1:25 [PATCH 01/10] regex: improve comments Paul Eggert
                   ` (2 preceding siblings ...)
  2021-02-06  1:25 ` [PATCH 04/10] regex: make it easier to merge into glibc Paul Eggert
@ 2021-02-06  1:25 ` Paul Eggert
  2021-02-06  1:25 ` [PATCH 06/10] regex: avoid duplicate in espilon closure Paul Eggert
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Paul Eggert @ 2021-02-06  1:25 UTC (permalink / raw)
  To: bug-gnulib; +Cc: Paul Eggert

* tests/test-regex.c (main): Fix typo that would have caused an
old test case to report incorrect values on failure.
---
 ChangeLog          | 4 ++++
 tests/test-regex.c | 3 ++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/ChangeLog b/ChangeLog
index 82aa61a04..f4583e0af 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,9 @@
 2021-02-05  Paul Eggert  <eggert@cs.ucla.edu>
 
+	regex-tests: fix typo
+	* tests/test-regex.c (main): Fix typo that would have caused an
+	old test case to report incorrect values on failure.
+
 	regex: make it easier to merge into glibc
 	* lib/regex_internal.h [_LIBC]: Do not include Gnulib’s dynarray.h.
 
diff --git a/tests/test-regex.c b/tests/test-regex.c
index 4a4ff39fd..58f8200f2 100644
--- a/tests/test-regex.c
+++ b/tests/test-regex.c
@@ -290,7 +290,8 @@ main (void)
     {
       memset (&regs, 0, sizeof regs);
       static char const data[] = "WXY";
-      if (re_search (&regex, data, sizeof data - 1, 0, 3, &regs) < 0)
+      i = re_search (&regex, data, sizeof data - 1, 0, 3, &regs);
+      if (i < 0)
         report_error ("re_search '%s' on '%s' returned %d", pat_x, data, i);
       regfree (&regex);
       free (regs.start);
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 06/10] regex: avoid duplicate in espilon closure
  2021-02-06  1:25 [PATCH 01/10] regex: improve comments Paul Eggert
                   ` (3 preceding siblings ...)
  2021-02-06  1:25 ` [PATCH 05/10] regex-tests: fix typo Paul Eggert
@ 2021-02-06  1:25 ` Paul Eggert
  2021-02-06  1:25 ` [PATCH 07/10] regex: fix longstanding backref match bug Paul Eggert
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Paul Eggert @ 2021-02-06  1:25 UTC (permalink / raw)
  To: bug-gnulib; +Cc: Paul Eggert

* lib/regcomp.c (calc_eclosure_iter): Insert NODE into epsilon
closure first rather than last.  Otherwise, the epsilon closure
might contain a duplicate of NODE.
---
 ChangeLog     | 5 +++++
 lib/regcomp.c | 8 +++-----
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index f4583e0af..74304474b 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,10 @@
 2021-02-05  Paul Eggert  <eggert@cs.ucla.edu>
 
+	regex: avoid duplicate in espilon closure
+	* lib/regcomp.c (calc_eclosure_iter): Insert NODE into epsilon
+	closure first rather than last.  Otherwise, the epsilon closure
+	might contain a duplicate of NODE.
+
 	regex-tests: fix typo
 	* tests/test-regex.c (main): Fix typo that would have caused an
 	old test case to report incorrect values on failure.
diff --git a/lib/regcomp.c b/lib/regcomp.c
index d93698ae7..887e5b506 100644
--- a/lib/regcomp.c
+++ b/lib/regcomp.c
@@ -1695,12 +1695,14 @@ calc_eclosure_iter (re_node_set *new_set, re_dfa_t *dfa, Idx node, bool root)
   reg_errcode_t err;
   Idx i;
   re_node_set eclosure;
-  bool ok;
   bool incomplete = false;
   err = re_node_set_alloc (&eclosure, dfa->edests[node].nelem + 1);
   if (__glibc_unlikely (err != REG_NOERROR))
     return err;
 
+  /* An epsilon closure includes itself.  */
+  eclosure.elems[eclosure.nelem++] = node;
+
   /* This indicates that we are calculating this node now.
      We reference this value to avoid infinite loop.  */
   dfa->eclosures[node].nelem = -1;
@@ -1753,10 +1755,6 @@ calc_eclosure_iter (re_node_set *new_set, re_dfa_t *dfa, Idx node, bool root)
 	  }
       }
 
-  /* An epsilon closure includes itself.  */
-  ok = re_node_set_insert (&eclosure, node);
-  if (__glibc_unlikely (! ok))
-    return REG_ESPACE;
   if (incomplete && !root)
     dfa->eclosures[node].nelem = 0;
   else
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 07/10] regex: fix longstanding backref match bug
  2021-02-06  1:25 [PATCH 01/10] regex: improve comments Paul Eggert
                   ` (4 preceding siblings ...)
  2021-02-06  1:25 ` [PATCH 06/10] regex: avoid duplicate in espilon closure Paul Eggert
@ 2021-02-06  1:25 ` Paul Eggert
  2021-02-09 19:42   ` Adhemerval Zanella
  2021-04-16 21:37   ` Dmitry V. Levin
  2021-02-06  1:26 ` [PATCH 08/10] regex: debug check for set member duplicates Paul Eggert
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 13+ messages in thread
From: Paul Eggert @ 2021-02-06  1:25 UTC (permalink / raw)
  To: bug-gnulib; +Cc: Paul Eggert

This fixes a longstanding glibc bug concerning backreferences
<https://sourceware.org/11053> (2009-12-04).
* lib/regexec.c (proceed_next_node, push_fail_stack)
(pop_fail_stack): Push and pop the previous registers
as well as the current ones.  All callers changed.
(set_regs): Also pop if CUR_NODE has already been checked,
so that it does not get added as a duplicate set entry.
(update_regs): Fix comment location.
* tests/test-regex.c (tests): New constant.
(bug_regex11): New test function.
(main): Bump alarm value.  Call new test function.
---
 ChangeLog          |  13 ++++++
 lib/regexec.c      |  26 +++++++----
 tests/test-regex.c | 113 ++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 141 insertions(+), 11 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 74304474b..bd7d1fa16 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,18 @@
 2021-02-05  Paul Eggert  <eggert@cs.ucla.edu>
 
+	regex: fix longstanding backref match bug
+	This fixes a longstanding glibc bug concerning backreferences
+	<https://sourceware.org/11053> (2009-12-04).
+	* lib/regexec.c (proceed_next_node, push_fail_stack)
+	(pop_fail_stack): Push and pop the previous registers
+	as well as the current ones.  All callers changed.
+	(set_regs): Also pop if CUR_NODE has already been checked,
+	so that it does not get added as a duplicate set entry.
+	(update_regs): Fix comment location.
+	* tests/test-regex.c (tests): New constant.
+	(bug_regex11): New test function.
+	(main): Bump alarm value.  Call new test function.
+
 	regex: avoid duplicate in espilon closure
 	* lib/regcomp.c (calc_eclosure_iter): Insert NODE into epsilon
 	closure first rather than last.  Otherwise, the epsilon closure
diff --git a/lib/regexec.c b/lib/regexec.c
index fdd2e373e..424bc8d15 100644
--- a/lib/regexec.c
+++ b/lib/regexec.c
@@ -59,7 +59,7 @@ static void update_regs (const re_dfa_t *dfa, regmatch_t *pmatch,
 			 Idx cur_idx, Idx nmatch);
 static reg_errcode_t push_fail_stack (struct re_fail_stack_t *fs,
 				      Idx str_idx, Idx dest_node, Idx nregs,
-				      regmatch_t *regs,
+				      regmatch_t *regs, regmatch_t *prevregs,
 				      re_node_set *eps_via_nodes);
 static reg_errcode_t set_regs (const regex_t *preg,
 			       const re_match_context_t *mctx,
@@ -1211,6 +1211,7 @@ check_halt_state_context (const re_match_context_t *mctx,
 
 static Idx
 proceed_next_node (const re_match_context_t *mctx, Idx nregs, regmatch_t *regs,
+		   regmatch_t *prevregs,
 		   Idx *pidx, Idx node, re_node_set *eps_via_nodes,
 		   struct re_fail_stack_t *fs)
 {
@@ -1243,7 +1244,7 @@ proceed_next_node (const re_match_context_t *mctx, Idx nregs, regmatch_t *regs,
 	      /* Otherwise, push the second epsilon-transition on the fail stack.  */
 	      else if (fs != NULL
 		       && push_fail_stack (fs, *pidx, candidate, nregs, regs,
-					   eps_via_nodes))
+					   prevregs, eps_via_nodes))
 		return -2;
 
 	      /* We know we are going to exit.  */
@@ -1316,7 +1317,8 @@ proceed_next_node (const re_match_context_t *mctx, Idx nregs, regmatch_t *regs,
 static reg_errcode_t
 __attribute_warn_unused_result__
 push_fail_stack (struct re_fail_stack_t *fs, Idx str_idx, Idx dest_node,
-		 Idx nregs, regmatch_t *regs, re_node_set *eps_via_nodes)
+		 Idx nregs, regmatch_t *regs, regmatch_t *prevregs,
+		 re_node_set *eps_via_nodes)
 {
   reg_errcode_t err;
   Idx num = fs->num++;
@@ -1332,23 +1334,26 @@ push_fail_stack (struct re_fail_stack_t *fs, Idx str_idx, Idx dest_node,
     }
   fs->stack[num].idx = str_idx;
   fs->stack[num].node = dest_node;
-  fs->stack[num].regs = re_malloc (regmatch_t, nregs);
+  fs->stack[num].regs = re_malloc (regmatch_t, 2 * nregs);
   if (fs->stack[num].regs == NULL)
     return REG_ESPACE;
   memcpy (fs->stack[num].regs, regs, sizeof (regmatch_t) * nregs);
+  memcpy (fs->stack[num].regs + nregs, prevregs, sizeof (regmatch_t) * nregs);
   err = re_node_set_init_copy (&fs->stack[num].eps_via_nodes, eps_via_nodes);
   return err;
 }
 
 static Idx
 pop_fail_stack (struct re_fail_stack_t *fs, Idx *pidx, Idx nregs,
-		regmatch_t *regs, re_node_set *eps_via_nodes)
+		regmatch_t *regs, regmatch_t *prevregs,
+		re_node_set *eps_via_nodes)
 {
   if (fs == NULL || fs->num == 0)
     return -1;
   Idx num = --fs->num;
   *pidx = fs->stack[num].idx;
   memcpy (regs, fs->stack[num].regs, sizeof (regmatch_t) * nregs);
+  memcpy (prevregs, fs->stack[num].regs + nregs, sizeof (regmatch_t) * nregs);
   re_node_set_free (eps_via_nodes);
   re_free (fs->stack[num].regs);
   *eps_via_nodes = fs->stack[num].eps_via_nodes;
@@ -1408,7 +1413,8 @@ set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
     {
       update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
 
-      if (idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
+      if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
+	  || re_node_set_contains (&eps_via_nodes, cur_node))
 	{
 	  Idx reg_idx;
 	  cur_node = -1;
@@ -1418,7 +1424,7 @@ set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
 		if (pmatch[reg_idx].rm_so > -1 && pmatch[reg_idx].rm_eo == -1)
 		  {
 		    cur_node = pop_fail_stack (fs, &idx, nmatch, pmatch,
-					       &eps_via_nodes);
+					       prev_idx_match, &eps_via_nodes);
 		    break;
 		  }
 	    }
@@ -1431,7 +1437,8 @@ set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
 	}
 
       /* Proceed to next node.  */
-      cur_node = proceed_next_node (mctx, nmatch, pmatch, &idx, cur_node,
+      cur_node = proceed_next_node (mctx, nmatch, pmatch, prev_idx_match,
+				    &idx, cur_node,
 				    &eps_via_nodes, fs);
 
       if (__glibc_unlikely (cur_node < 0))
@@ -1443,7 +1450,8 @@ set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
 	      free_fail_stack_return (fs);
 	      return REG_ESPACE;
 	    }
-	  cur_node = pop_fail_stack (fs, &idx, nmatch, pmatch, &eps_via_nodes);
+	  cur_node = pop_fail_stack (fs, &idx, nmatch, pmatch,
+				     prev_idx_match, &eps_via_nodes);
 	  if (cur_node < 0)
 	    {
 	      re_node_set_free (&eps_via_nodes);
diff --git a/tests/test-regex.c b/tests/test-regex.c
index 58f8200f2..a14619805 100644
--- a/tests/test-regex.c
+++ b/tests/test-regex.c
@@ -55,6 +55,111 @@ really_utf8 (void)
   return strcmp (locale_charset (), "UTF-8") == 0;
 }
 
+/* Tests supposed to match; copied from glibc posix/bug-regex11.c.  */
+static struct
+{
+  const char *pattern;
+  const char *string;
+  int flags, nmatch;
+  regmatch_t rm[5];
+} const tests[] = {
+  /* Test for newline handling in regex.  */
+  { "[^~]*~", "\nx~y", 0, 2, { { 0, 3 }, { -1, -1 } } },
+  /* Other tests.  */
+  { "a(.*)b", "a b", REG_EXTENDED, 2, { { 0, 3 }, { 1, 2 } } },
+  { ".*|\\([KIO]\\)\\([^|]*\\).*|?[KIO]", "10~.~|P|K0|I10|O16|?KSb", 0, 3,
+    { { 0, 21 }, { 15, 16 }, { 16, 18 } } },
+  { ".*|\\([KIO]\\)\\([^|]*\\).*|?\\1", "10~.~|P|K0|I10|O16|?KSb", 0, 3,
+    { { 0, 21 }, { 8, 9 }, { 9, 10 } } },
+  { "^\\(a*\\)\\1\\{9\\}\\(a\\{0,9\\}\\)\\([0-9]*;.*[^a]\\2\\([0-9]\\)\\)",
+    "a1;;0a1aa2aaa3aaaa4aaaaa5aaaaaa6aaaaaaa7aaaaaaaa8aaaaaaaaa9aa2aa1a0", 0,
+    5, { { 0, 67 }, { 0, 0 }, { 0, 1 }, { 1, 67 }, { 66, 67 } } },
+  /* Test for BRE expression anchoring.  POSIX says just that this may match;
+     in glibc regex it always matched, so avoid changing it.  */
+  { "\\(^\\|foo\\)bar", "bar", 0, 2, { { 0, 3 }, { -1, -1 } } },
+  { "\\(foo\\|^\\)bar", "bar", 0, 2, { { 0, 3 }, { -1, -1 } } },
+  /* In ERE this must be treated as an anchor.  */
+  { "(^|foo)bar", "bar", REG_EXTENDED, 2, { { 0, 3 }, { -1, -1 } } },
+  { "(foo|^)bar", "bar", REG_EXTENDED, 2, { { 0, 3 }, { -1, -1 } } },
+  /* Here ^ cannot be treated as an anchor according to POSIX.  */
+  { "(^|foo)bar", "(^|foo)bar", 0, 2, { { 0, 10 }, { -1, -1 } } },
+  { "(foo|^)bar", "(foo|^)bar", 0, 2, { { 0, 10 }, { -1, -1 } } },
+  /* More tests on backreferences.  */
+  { "()\\1", "x", REG_EXTENDED, 2, { { 0, 0 }, { 0, 0 } } },
+  { "()x\\1", "x", REG_EXTENDED, 2, { { 0, 1 }, { 0, 0 } } },
+  { "()\\1*\\1*", "", REG_EXTENDED, 2, { { 0, 0 }, { 0, 0 } } },
+  { "([0-9]).*\\1(a*)", "7;7a6", REG_EXTENDED, 3, { { 0, 4 }, { 0, 1 }, { 3, 4 } } },
+  { "([0-9]).*\\1(a*)", "7;7a", REG_EXTENDED, 3, { { 0, 4 }, { 0, 1 }, { 3, 4 } } },
+  { "(b)()c\\1", "bcb", REG_EXTENDED, 3, { { 0, 3 }, { 0, 1 }, { 1, 1 } } },
+  { "()(b)c\\2", "bcb", REG_EXTENDED, 3, { { 0, 3 }, { 0, 0 }, { 0, 1 } } },
+  { "a(b)()c\\1", "abcb", REG_EXTENDED, 3, { { 0, 4 }, { 1, 2 }, { 2, 2 } } },
+  { "a()(b)c\\2", "abcb", REG_EXTENDED, 3, { { 0, 4 }, { 1, 1 }, { 1, 2 } } },
+  { "()(b)\\1c\\2", "bcb", REG_EXTENDED, 3, { { 0, 3 }, { 0, 0 }, { 0, 1 } } },
+  { "(b())\\2\\1", "bbbb", REG_EXTENDED, 3, { { 0, 2 }, { 0, 1 }, { 1, 1 } } },
+  { "a()(b)\\1c\\2", "abcb", REG_EXTENDED, 3, { { 0, 4 }, { 1, 1 }, { 1, 2 } } },
+  { "a()d(b)\\1c\\2", "adbcb", REG_EXTENDED, 3, { { 0, 5 }, { 1, 1 }, { 2, 3 } } },
+  { "a(b())\\2\\1", "abbbb", REG_EXTENDED, 3, { { 0, 3 }, { 1, 2 }, { 2, 2 } } },
+  { "(bb())\\2\\1", "bbbb", REG_EXTENDED, 3, { { 0, 4 }, { 0, 2 }, { 2, 2 } } },
+  { "^([^,]*),\\1,\\1$", "a,a,a", REG_EXTENDED, 2, { { 0, 5 }, { 0, 1 } } },
+  { "^([^,]*),\\1,\\1$", "ab,ab,ab", REG_EXTENDED, 2, { { 0, 8 }, { 0, 2 } } },
+  { "^([^,]*),\\1,\\1,\\1$", "abc,abc,abc,abc", REG_EXTENDED, 2,
+    { { 0, 15 }, { 0, 3 } } },
+  { "^(.?)(.?)(.?)(.?)(.?).?\\5\\4\\3\\2\\1$",
+    "level", REG_NOSUB | REG_EXTENDED, 0, { { -1, -1 } } },
+  { "^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.).?\\9\\8\\7\\6\\5\\4\\3\\2\\1$|^.?$",
+    "level", REG_NOSUB | REG_EXTENDED, 0, { { -1, -1 } } },
+  { "^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.).?\\9\\8\\7\\6\\5\\4\\3\\2\\1$|^.?$",
+    "abcdedcba", REG_EXTENDED, 1, { { 0, 9 } } },
+  /* XXX Not used since they fail so far.  */
+  { "^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.).?\\9\\8\\7\\6\\5\\4\\3\\2\\1$|^.?$",
+    "ababababa", REG_EXTENDED, 1, { { 0, 9 } } },
+  { "^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?).?\\9\\8\\7\\6\\5\\4\\3\\2\\1$",
+    "level", REG_NOSUB | REG_EXTENDED, 0, { { -1, -1 } } },
+  { "^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?).?\\9\\8\\7\\6\\5\\4\\3\\2\\1$",
+    "ababababa", REG_EXTENDED, 1, { { 0, 9 } } },
+};
+
+static void
+bug_regex11 (void)
+{
+  regex_t re;
+  regmatch_t rm[5];
+  size_t i;
+  int n;
+
+  for (i = 0; i < sizeof (tests) / sizeof (tests[0]); ++i)
+    {
+      n = regcomp (&re, tests[i].pattern, tests[i].flags);
+      if (n != 0)
+	{
+	  char buf[500];
+	  regerror (n, &re, buf, sizeof (buf));
+	  report_error ("%s: regcomp %zd failed: %s", tests[i].pattern, i, buf);
+	  continue;
+	}
+
+      if (regexec (&re, tests[i].string, tests[i].nmatch, rm, 0))
+	{
+	  report_error ("%s: regexec %zd failed", tests[i].pattern, i);
+	  regfree (&re);
+	  continue;
+	}
+
+      for (n = 0; n < tests[i].nmatch; ++n)
+	if (rm[n].rm_so != tests[i].rm[n].rm_so
+              || rm[n].rm_eo != tests[i].rm[n].rm_eo)
+	  {
+	    if (tests[i].rm[n].rm_so == -1 && tests[i].rm[n].rm_eo == -1)
+	      break;
+	    report_error ("%s: regexec %zd match failure rm[%d] %d..%d",
+                          tests[i].pattern, i, n, rm[n].rm_so, rm[n].rm_eo);
+	    break;
+	  }
+
+      regfree (&re);
+    }
+}
+
 int
 main (void)
 {
@@ -65,11 +170,15 @@ main (void)
   struct re_registers regs;
 
 #if HAVE_DECL_ALARM
-  /* Some builds of glibc go into an infinite loop on this test.  */
-  int alarm_value = 2;
+  /* In case a bug causes glibc to go into an infinite loop.
+     The tests should take less than 10 s on a reasonably modern CPU.  */
+  int alarm_value = 1000;
   signal (SIGALRM, SIG_DFL);
   alarm (alarm_value);
 #endif
+
+  bug_regex11 ();
+
   if (setlocale (LC_ALL, "en_US.UTF-8"))
     {
       {
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 08/10] regex: debug check for set member duplicates
  2021-02-06  1:25 [PATCH 01/10] regex: improve comments Paul Eggert
                   ` (5 preceding siblings ...)
  2021-02-06  1:25 ` [PATCH 07/10] regex: fix longstanding backref match bug Paul Eggert
@ 2021-02-06  1:26 ` Paul Eggert
  2021-02-06  1:26 ` [PATCH 09/10] regex-tests: add bug 11053 test Paul Eggert
  2021-02-06  1:26 ` [PATCH 10/10] regex: fix comment location Paul Eggert
  8 siblings, 0 replies; 13+ messages in thread
From: Paul Eggert @ 2021-02-06  1:26 UTC (permalink / raw)
  To: bug-gnulib; +Cc: Paul Eggert

* lib/regex_internal.c (re_node_set_insert): Add a DEBUG_ASSERT
that would have caught some recently-fixed performance bugs
that caused sets to contain duplicate members.
---
 ChangeLog            | 5 +++++
 lib/regex_internal.c | 1 +
 2 files changed, 6 insertions(+)

diff --git a/ChangeLog b/ChangeLog
index bd7d1fa16..69abc35dc 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,10 @@
 2021-02-05  Paul Eggert  <eggert@cs.ucla.edu>
 
+	regex: debug check for set member duplicates
+	* lib/regex_internal.c (re_node_set_insert): Add a DEBUG_ASSERT
+	that would have caught some recently-fixed performance bugs
+	that caused sets to contain duplicate members.
+
 	regex: fix longstanding backref match bug
 	This fixes a longstanding glibc bug concerning backreferences
 	<https://sourceware.org/11053> (2009-12-04).
diff --git a/lib/regex_internal.c b/lib/regex_internal.c
index 9dd387ef8..55f6b66de 100644
--- a/lib/regex_internal.c
+++ b/lib/regex_internal.c
@@ -1314,6 +1314,7 @@ re_node_set_insert (re_node_set *set, Idx elem)
     {
       for (idx = set->nelem; set->elems[idx - 1] > elem; idx--)
 	set->elems[idx] = set->elems[idx - 1];
+      DEBUG_ASSERT (set->elems[idx - 1] < elem);
     }
 
   /* Insert the new element.  */
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 09/10] regex-tests: add bug 11053 test
  2021-02-06  1:25 [PATCH 01/10] regex: improve comments Paul Eggert
                   ` (6 preceding siblings ...)
  2021-02-06  1:26 ` [PATCH 08/10] regex: debug check for set member duplicates Paul Eggert
@ 2021-02-06  1:26 ` Paul Eggert
  2021-02-06  1:26 ` [PATCH 10/10] regex: fix comment location Paul Eggert
  8 siblings, 0 replies; 13+ messages in thread
From: Paul Eggert @ 2021-02-06  1:26 UTC (permalink / raw)
  To: bug-gnulib; +Cc: Paul Eggert

* tests/test-regex.c (main): New test case for glibc bug 11053.
---
 ChangeLog          |  3 +++
 tests/test-regex.c | 29 +++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/ChangeLog b/ChangeLog
index 69abc35dc..d838ad0d0 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,8 @@
 2021-02-05  Paul Eggert  <eggert@cs.ucla.edu>
 
+	regex-tests: add bug 11053 test
+	* tests/test-regex.c (main): New test case for glibc bug 11053.
+
 	regex: debug check for set member duplicates
 	* lib/regex_internal.c (re_node_set_insert): Add a DEBUG_ASSERT
 	that would have caught some recently-fixed performance bugs
diff --git a/tests/test-regex.c b/tests/test-regex.c
index a14619805..adccf2187 100644
--- a/tests/test-regex.c
+++ b/tests/test-regex.c
@@ -407,6 +407,35 @@ main (void)
       free (regs.end);
     }
 
+  /* glibc bug 11053.  */
+  re_set_syntax (RE_SYNTAX_POSIX_BASIC);
+  memset (&regex, 0, sizeof regex);
+  static char const pat_sub2[] = "\\(a*\\)*a*\\1";
+  s = re_compile_pattern (pat_sub2, sizeof pat_sub2 - 1, &regex);
+  if (s)
+    report_error ("%s: %s", pat_sub2, s);
+  else
+    {
+      memset (&regs, 0, sizeof regs);
+      static char const data[] = "a";
+      int datalen = sizeof data - 1;
+      i = re_search (&regex, data, datalen, 0, datalen, &regs);
+      if (i != 0)
+        report_error ("re_search '%s' on '%s' returned %d", pat_sub2, data, i);
+      else if (regs.num_regs < 2)
+        report_error ("re_search '%s' on '%s' returned only %d registers",
+                      pat_sub2, data, (int) regs.num_regs);
+      else if (! (regs.start[0] == 0 && regs.end[0] == 1))
+        report_error ("re_search '%s' on '%s' returned wrong match [%d,%d)",
+                      pat_sub2, data, (int) regs.start[0], (int) regs.end[0]);
+      else if (! (regs.start[1] == 0 && regs.end[1] == 0))
+        report_error ("re_search '%s' on '%s' returned wrong submatch [%d,%d)",
+                      pat_sub2, data, regs.start[1], regs.end[1]);
+      regfree (&regex);
+      free (regs.start);
+      free (regs.end);
+    }
+
   /* Catch a bug reported by Vin Shelton in
      https://lists.gnu.org/r/bug-coreutils/2007-06/msg00089.html
      */
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 10/10] regex: fix comment location
  2021-02-06  1:25 [PATCH 01/10] regex: improve comments Paul Eggert
                   ` (7 preceding siblings ...)
  2021-02-06  1:26 ` [PATCH 09/10] regex-tests: add bug 11053 test Paul Eggert
@ 2021-02-06  1:26 ` Paul Eggert
  8 siblings, 0 replies; 13+ messages in thread
From: Paul Eggert @ 2021-02-06  1:26 UTC (permalink / raw)
  To: bug-gnulib; +Cc: Paul Eggert

* lib/regexec.c (update_regs): Move comment.
---
 ChangeLog     | 3 +++
 lib/regexec.c | 2 +-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/ChangeLog b/ChangeLog
index d838ad0d0..70e357884 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,8 @@
 2021-02-05  Paul Eggert  <eggert@cs.ucla.edu>
 
+	regex: fix comment location
+	* lib/regexec.c (update_regs): Move comment.
+
 	regex-tests: add bug 11053 test
 	* tests/test-regex.c (main): New test case for glibc bug 11053.
 
diff --git a/lib/regexec.c b/lib/regexec.c
index 424bc8d15..6309deac8 100644
--- a/lib/regexec.c
+++ b/lib/regexec.c
@@ -1500,10 +1500,10 @@ update_regs (const re_dfa_t *dfa, regmatch_t *pmatch,
     }
   else if (type == OP_CLOSE_SUBEXP)
     {
+      /* We are at the last node of this sub expression.  */
       Idx reg_num = dfa->nodes[cur_node].opr.idx + 1;
       if (reg_num < nmatch)
 	{
-	  /* We are at the last node of this sub expression.  */
 	  if (pmatch[reg_num].rm_so < cur_idx)
 	    {
 	      pmatch[reg_num].rm_eo = cur_idx;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 07/10] regex: fix longstanding backref match bug
  2021-02-06  1:25 ` [PATCH 07/10] regex: fix longstanding backref match bug Paul Eggert
@ 2021-02-09 19:42   ` Adhemerval Zanella
  2021-02-09 22:22     ` Paul Eggert
  2021-04-16 21:37   ` Dmitry V. Levin
  1 sibling, 1 reply; 13+ messages in thread
From: Adhemerval Zanella @ 2021-02-09 19:42 UTC (permalink / raw)
  To: Paul Eggert, bug-gnulib

Hi Paul,

Trying to sync gnulib with glibc code, this patch trigger some regression
on glibc testcases:

FAIL: posix/tst-boost
FAIL: posix/tst-pcre
FAIL: posix/tst-rxspencer
FAIL: posix/tst-rxspencer-no-utf8

$ grep -n "FAIL rm" posix/tst-boost.out
445:FAIL rm[1] 3..-1 != expected 2..3
448:FAIL rm[1] 3..-1 != expected 2..3
451:FAIL rm[1] 3..-1 != expected 2..3
454:FAIL rm[1] 3..-1 != expected 2..3

$ cat posix/tst-pcre.out 
1168: /([a]*)*/ on a unexpectedly failed to match register 1 1..-1
1171: /([a]*)*/ on aaaaa  unexpectedly failed to match register 1 5..-1
1176: /([ab]*)*/ on a unexpectedly failed to match register 1 1..-1
1179: /([ab]*)*/ on b unexpectedly failed to match register 1 1..-1
1182: /([ab]*)*/ on ababab unexpectedly failed to match register 1 6..-1
1185: /([ab]*)*/ on aaaabcde unexpectedly failed to match register 1 5..-1
1188: /([ab]*)*/ on bbbb     unexpectedly failed to match register 1 4..-1
1193: /([^a]*)*/ on b unexpectedly failed to match register 1 1..-1
1196: /([^a]*)*/ on bbbb unexpectedly failed to match register 1 4..-1
1203: /([^ab]*)*/ on cccc unexpectedly failed to match register 1 4..-1

$ cat posix/tst-rxspencer.out  | grep FAIL
FAIL rm[1] unexpectedly did not match
FAIL rm[1] unexpectedly did not match

$ cat posix/tst-rxspencer-no-utf8.out | grep FAIL
FAIL rm[1] unexpectedly did not match
FAIL rm[1] unexpectedly did not match

On 05/02/2021 22:25, Paul Eggert wrote:
> This fixes a longstanding glibc bug concerning backreferences
> <https://sourceware.org/11053> (2009-12-04).
> * lib/regexec.c (proceed_next_node, push_fail_stack)
> (pop_fail_stack): Push and pop the previous registers
> as well as the current ones.  All callers changed.
> (set_regs): Also pop if CUR_NODE has already been checked,
> so that it does not get added as a duplicate set entry.
> (update_regs): Fix comment location.
> * tests/test-regex.c (tests): New constant.
> (bug_regex11): New test function.
> (main): Bump alarm value.  Call new test function.
> ---
>  ChangeLog          |  13 ++++++
>  lib/regexec.c      |  26 +++++++----
>  tests/test-regex.c | 113 ++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 141 insertions(+), 11 deletions(-)
> 
> diff --git a/ChangeLog b/ChangeLog
> index 74304474b..bd7d1fa16 100644
> --- a/ChangeLog
> +++ b/ChangeLog
> @@ -1,5 +1,18 @@
>  2021-02-05  Paul Eggert  <eggert@cs.ucla.edu>
>  
> +	regex: fix longstanding backref match bug
> +	This fixes a longstanding glibc bug concerning backreferences
> +	<https://sourceware.org/11053> (2009-12-04).
> +	* lib/regexec.c (proceed_next_node, push_fail_stack)
> +	(pop_fail_stack): Push and pop the previous registers
> +	as well as the current ones.  All callers changed.
> +	(set_regs): Also pop if CUR_NODE has already been checked,
> +	so that it does not get added as a duplicate set entry.
> +	(update_regs): Fix comment location.
> +	* tests/test-regex.c (tests): New constant.
> +	(bug_regex11): New test function.
> +	(main): Bump alarm value.  Call new test function.
> +
>  	regex: avoid duplicate in espilon closure
>  	* lib/regcomp.c (calc_eclosure_iter): Insert NODE into epsilon
>  	closure first rather than last.  Otherwise, the epsilon closure
> diff --git a/lib/regexec.c b/lib/regexec.c
> index fdd2e373e..424bc8d15 100644
> --- a/lib/regexec.c
> +++ b/lib/regexec.c
> @@ -59,7 +59,7 @@ static void update_regs (const re_dfa_t *dfa, regmatch_t *pmatch,
>  			 Idx cur_idx, Idx nmatch);
>  static reg_errcode_t push_fail_stack (struct re_fail_stack_t *fs,
>  				      Idx str_idx, Idx dest_node, Idx nregs,
> -				      regmatch_t *regs,
> +				      regmatch_t *regs, regmatch_t *prevregs,
>  				      re_node_set *eps_via_nodes);
>  static reg_errcode_t set_regs (const regex_t *preg,
>  			       const re_match_context_t *mctx,
> @@ -1211,6 +1211,7 @@ check_halt_state_context (const re_match_context_t *mctx,
>  
>  static Idx
>  proceed_next_node (const re_match_context_t *mctx, Idx nregs, regmatch_t *regs,
> +		   regmatch_t *prevregs,
>  		   Idx *pidx, Idx node, re_node_set *eps_via_nodes,
>  		   struct re_fail_stack_t *fs)
>  {
> @@ -1243,7 +1244,7 @@ proceed_next_node (const re_match_context_t *mctx, Idx nregs, regmatch_t *regs,
>  	      /* Otherwise, push the second epsilon-transition on the fail stack.  */
>  	      else if (fs != NULL
>  		       && push_fail_stack (fs, *pidx, candidate, nregs, regs,
> -					   eps_via_nodes))
> +					   prevregs, eps_via_nodes))
>  		return -2;
>  
>  	      /* We know we are going to exit.  */
> @@ -1316,7 +1317,8 @@ proceed_next_node (const re_match_context_t *mctx, Idx nregs, regmatch_t *regs,
>  static reg_errcode_t
>  __attribute_warn_unused_result__
>  push_fail_stack (struct re_fail_stack_t *fs, Idx str_idx, Idx dest_node,
> -		 Idx nregs, regmatch_t *regs, re_node_set *eps_via_nodes)
> +		 Idx nregs, regmatch_t *regs, regmatch_t *prevregs,
> +		 re_node_set *eps_via_nodes)
>  {
>    reg_errcode_t err;
>    Idx num = fs->num++;
> @@ -1332,23 +1334,26 @@ push_fail_stack (struct re_fail_stack_t *fs, Idx str_idx, Idx dest_node,
>      }
>    fs->stack[num].idx = str_idx;
>    fs->stack[num].node = dest_node;
> -  fs->stack[num].regs = re_malloc (regmatch_t, nregs);
> +  fs->stack[num].regs = re_malloc (regmatch_t, 2 * nregs);
>    if (fs->stack[num].regs == NULL)
>      return REG_ESPACE;
>    memcpy (fs->stack[num].regs, regs, sizeof (regmatch_t) * nregs);
> +  memcpy (fs->stack[num].regs + nregs, prevregs, sizeof (regmatch_t) * nregs);
>    err = re_node_set_init_copy (&fs->stack[num].eps_via_nodes, eps_via_nodes);
>    return err;
>  }
>  
>  static Idx
>  pop_fail_stack (struct re_fail_stack_t *fs, Idx *pidx, Idx nregs,
> -		regmatch_t *regs, re_node_set *eps_via_nodes)
> +		regmatch_t *regs, regmatch_t *prevregs,
> +		re_node_set *eps_via_nodes)
>  {
>    if (fs == NULL || fs->num == 0)
>      return -1;
>    Idx num = --fs->num;
>    *pidx = fs->stack[num].idx;
>    memcpy (regs, fs->stack[num].regs, sizeof (regmatch_t) * nregs);
> +  memcpy (prevregs, fs->stack[num].regs + nregs, sizeof (regmatch_t) * nregs);
>    re_node_set_free (eps_via_nodes);
>    re_free (fs->stack[num].regs);
>    *eps_via_nodes = fs->stack[num].eps_via_nodes;
> @@ -1408,7 +1413,8 @@ set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
>      {
>        update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
>  
> -      if (idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
> +      if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
> +	  || re_node_set_contains (&eps_via_nodes, cur_node))
>  	{
>  	  Idx reg_idx;
>  	  cur_node = -1;
> @@ -1418,7 +1424,7 @@ set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
>  		if (pmatch[reg_idx].rm_so > -1 && pmatch[reg_idx].rm_eo == -1)
>  		  {
>  		    cur_node = pop_fail_stack (fs, &idx, nmatch, pmatch,
> -					       &eps_via_nodes);
> +					       prev_idx_match, &eps_via_nodes);
>  		    break;
>  		  }
>  	    }
> @@ -1431,7 +1437,8 @@ set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
>  	}
>  
>        /* Proceed to next node.  */
> -      cur_node = proceed_next_node (mctx, nmatch, pmatch, &idx, cur_node,
> +      cur_node = proceed_next_node (mctx, nmatch, pmatch, prev_idx_match,
> +				    &idx, cur_node,
>  				    &eps_via_nodes, fs);
>  
>        if (__glibc_unlikely (cur_node < 0))
> @@ -1443,7 +1450,8 @@ set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
>  	      free_fail_stack_return (fs);
>  	      return REG_ESPACE;
>  	    }
> -	  cur_node = pop_fail_stack (fs, &idx, nmatch, pmatch, &eps_via_nodes);
> +	  cur_node = pop_fail_stack (fs, &idx, nmatch, pmatch,
> +				     prev_idx_match, &eps_via_nodes);
>  	  if (cur_node < 0)
>  	    {
>  	      re_node_set_free (&eps_via_nodes);
> diff --git a/tests/test-regex.c b/tests/test-regex.c
> index 58f8200f2..a14619805 100644
> --- a/tests/test-regex.c
> +++ b/tests/test-regex.c
> @@ -55,6 +55,111 @@ really_utf8 (void)
>    return strcmp (locale_charset (), "UTF-8") == 0;
>  }
>  
> +/* Tests supposed to match; copied from glibc posix/bug-regex11.c.  */
> +static struct
> +{
> +  const char *pattern;
> +  const char *string;
> +  int flags, nmatch;
> +  regmatch_t rm[5];
> +} const tests[] = {
> +  /* Test for newline handling in regex.  */
> +  { "[^~]*~", "\nx~y", 0, 2, { { 0, 3 }, { -1, -1 } } },
> +  /* Other tests.  */
> +  { "a(.*)b", "a b", REG_EXTENDED, 2, { { 0, 3 }, { 1, 2 } } },
> +  { ".*|\\([KIO]\\)\\([^|]*\\).*|?[KIO]", "10~.~|P|K0|I10|O16|?KSb", 0, 3,
> +    { { 0, 21 }, { 15, 16 }, { 16, 18 } } },
> +  { ".*|\\([KIO]\\)\\([^|]*\\).*|?\\1", "10~.~|P|K0|I10|O16|?KSb", 0, 3,
> +    { { 0, 21 }, { 8, 9 }, { 9, 10 } } },
> +  { "^\\(a*\\)\\1\\{9\\}\\(a\\{0,9\\}\\)\\([0-9]*;.*[^a]\\2\\([0-9]\\)\\)",
> +    "a1;;0a1aa2aaa3aaaa4aaaaa5aaaaaa6aaaaaaa7aaaaaaaa8aaaaaaaaa9aa2aa1a0", 0,
> +    5, { { 0, 67 }, { 0, 0 }, { 0, 1 }, { 1, 67 }, { 66, 67 } } },
> +  /* Test for BRE expression anchoring.  POSIX says just that this may match;
> +     in glibc regex it always matched, so avoid changing it.  */
> +  { "\\(^\\|foo\\)bar", "bar", 0, 2, { { 0, 3 }, { -1, -1 } } },
> +  { "\\(foo\\|^\\)bar", "bar", 0, 2, { { 0, 3 }, { -1, -1 } } },
> +  /* In ERE this must be treated as an anchor.  */
> +  { "(^|foo)bar", "bar", REG_EXTENDED, 2, { { 0, 3 }, { -1, -1 } } },
> +  { "(foo|^)bar", "bar", REG_EXTENDED, 2, { { 0, 3 }, { -1, -1 } } },
> +  /* Here ^ cannot be treated as an anchor according to POSIX.  */
> +  { "(^|foo)bar", "(^|foo)bar", 0, 2, { { 0, 10 }, { -1, -1 } } },
> +  { "(foo|^)bar", "(foo|^)bar", 0, 2, { { 0, 10 }, { -1, -1 } } },
> +  /* More tests on backreferences.  */
> +  { "()\\1", "x", REG_EXTENDED, 2, { { 0, 0 }, { 0, 0 } } },
> +  { "()x\\1", "x", REG_EXTENDED, 2, { { 0, 1 }, { 0, 0 } } },
> +  { "()\\1*\\1*", "", REG_EXTENDED, 2, { { 0, 0 }, { 0, 0 } } },
> +  { "([0-9]).*\\1(a*)", "7;7a6", REG_EXTENDED, 3, { { 0, 4 }, { 0, 1 }, { 3, 4 } } },
> +  { "([0-9]).*\\1(a*)", "7;7a", REG_EXTENDED, 3, { { 0, 4 }, { 0, 1 }, { 3, 4 } } },
> +  { "(b)()c\\1", "bcb", REG_EXTENDED, 3, { { 0, 3 }, { 0, 1 }, { 1, 1 } } },
> +  { "()(b)c\\2", "bcb", REG_EXTENDED, 3, { { 0, 3 }, { 0, 0 }, { 0, 1 } } },
> +  { "a(b)()c\\1", "abcb", REG_EXTENDED, 3, { { 0, 4 }, { 1, 2 }, { 2, 2 } } },
> +  { "a()(b)c\\2", "abcb", REG_EXTENDED, 3, { { 0, 4 }, { 1, 1 }, { 1, 2 } } },
> +  { "()(b)\\1c\\2", "bcb", REG_EXTENDED, 3, { { 0, 3 }, { 0, 0 }, { 0, 1 } } },
> +  { "(b())\\2\\1", "bbbb", REG_EXTENDED, 3, { { 0, 2 }, { 0, 1 }, { 1, 1 } } },
> +  { "a()(b)\\1c\\2", "abcb", REG_EXTENDED, 3, { { 0, 4 }, { 1, 1 }, { 1, 2 } } },
> +  { "a()d(b)\\1c\\2", "adbcb", REG_EXTENDED, 3, { { 0, 5 }, { 1, 1 }, { 2, 3 } } },
> +  { "a(b())\\2\\1", "abbbb", REG_EXTENDED, 3, { { 0, 3 }, { 1, 2 }, { 2, 2 } } },
> +  { "(bb())\\2\\1", "bbbb", REG_EXTENDED, 3, { { 0, 4 }, { 0, 2 }, { 2, 2 } } },
> +  { "^([^,]*),\\1,\\1$", "a,a,a", REG_EXTENDED, 2, { { 0, 5 }, { 0, 1 } } },
> +  { "^([^,]*),\\1,\\1$", "ab,ab,ab", REG_EXTENDED, 2, { { 0, 8 }, { 0, 2 } } },
> +  { "^([^,]*),\\1,\\1,\\1$", "abc,abc,abc,abc", REG_EXTENDED, 2,
> +    { { 0, 15 }, { 0, 3 } } },
> +  { "^(.?)(.?)(.?)(.?)(.?).?\\5\\4\\3\\2\\1$",
> +    "level", REG_NOSUB | REG_EXTENDED, 0, { { -1, -1 } } },
> +  { "^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.).?\\9\\8\\7\\6\\5\\4\\3\\2\\1$|^.?$",
> +    "level", REG_NOSUB | REG_EXTENDED, 0, { { -1, -1 } } },
> +  { "^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.).?\\9\\8\\7\\6\\5\\4\\3\\2\\1$|^.?$",
> +    "abcdedcba", REG_EXTENDED, 1, { { 0, 9 } } },
> +  /* XXX Not used since they fail so far.  */
> +  { "^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.).?\\9\\8\\7\\6\\5\\4\\3\\2\\1$|^.?$",
> +    "ababababa", REG_EXTENDED, 1, { { 0, 9 } } },
> +  { "^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?).?\\9\\8\\7\\6\\5\\4\\3\\2\\1$",
> +    "level", REG_NOSUB | REG_EXTENDED, 0, { { -1, -1 } } },
> +  { "^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?).?\\9\\8\\7\\6\\5\\4\\3\\2\\1$",
> +    "ababababa", REG_EXTENDED, 1, { { 0, 9 } } },
> +};
> +
> +static void
> +bug_regex11 (void)
> +{
> +  regex_t re;
> +  regmatch_t rm[5];
> +  size_t i;
> +  int n;
> +
> +  for (i = 0; i < sizeof (tests) / sizeof (tests[0]); ++i)
> +    {
> +      n = regcomp (&re, tests[i].pattern, tests[i].flags);
> +      if (n != 0)
> +	{
> +	  char buf[500];
> +	  regerror (n, &re, buf, sizeof (buf));
> +	  report_error ("%s: regcomp %zd failed: %s", tests[i].pattern, i, buf);
> +	  continue;
> +	}
> +
> +      if (regexec (&re, tests[i].string, tests[i].nmatch, rm, 0))
> +	{
> +	  report_error ("%s: regexec %zd failed", tests[i].pattern, i);
> +	  regfree (&re);
> +	  continue;
> +	}
> +
> +      for (n = 0; n < tests[i].nmatch; ++n)
> +	if (rm[n].rm_so != tests[i].rm[n].rm_so
> +              || rm[n].rm_eo != tests[i].rm[n].rm_eo)
> +	  {
> +	    if (tests[i].rm[n].rm_so == -1 && tests[i].rm[n].rm_eo == -1)
> +	      break;
> +	    report_error ("%s: regexec %zd match failure rm[%d] %d..%d",
> +                          tests[i].pattern, i, n, rm[n].rm_so, rm[n].rm_eo);
> +	    break;
> +	  }
> +
> +      regfree (&re);
> +    }
> +}
> +
>  int
>  main (void)
>  {
> @@ -65,11 +170,15 @@ main (void)
>    struct re_registers regs;
>  
>  #if HAVE_DECL_ALARM
> -  /* Some builds of glibc go into an infinite loop on this test.  */
> -  int alarm_value = 2;
> +  /* In case a bug causes glibc to go into an infinite loop.
> +     The tests should take less than 10 s on a reasonably modern CPU.  */
> +  int alarm_value = 1000;
>    signal (SIGALRM, SIG_DFL);
>    alarm (alarm_value);
>  #endif
> +
> +  bug_regex11 ();
> +
>    if (setlocale (LC_ALL, "en_US.UTF-8"))
>      {
>        {
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 07/10] regex: fix longstanding backref match bug
  2021-02-09 19:42   ` Adhemerval Zanella
@ 2021-02-09 22:22     ` Paul Eggert
  0 siblings, 0 replies; 13+ messages in thread
From: Paul Eggert @ 2021-02-09 22:22 UTC (permalink / raw)
  To: Adhemerval Zanella, bug-gnulib

On 2/9/21 11:42 AM, Adhemerval Zanella wrote:
> Trying to sync gnulib with glibc code, this patch trigger some regression
> on glibc testcases:

Thanks for letting me know. That's odd, as I didn't find any regressions 
when I synced. I do know of remaining bugs and will add these to my list.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 07/10] regex: fix longstanding backref match bug
  2021-02-06  1:25 ` [PATCH 07/10] regex: fix longstanding backref match bug Paul Eggert
  2021-02-09 19:42   ` Adhemerval Zanella
@ 2021-04-16 21:37   ` Dmitry V. Levin
  1 sibling, 0 replies; 13+ messages in thread
From: Dmitry V. Levin @ 2021-04-16 21:37 UTC (permalink / raw)
  To: Paul Eggert; +Cc: bug-gnulib

On Fri, Feb 05, 2021 at 05:25:59PM -0800, Paul Eggert wrote:
> This fixes a longstanding glibc bug concerning backreferences
> <https://sourceware.org/11053> (2009-12-04).

Nit: this short URL doesn't work, while the longer one does:
https://sourceware.org/bugzilla/show_bug.cgi?id=11053

> * lib/regexec.c (proceed_next_node, push_fail_stack)
> (pop_fail_stack): Push and pop the previous registers
> as well as the current ones.  All callers changed.
> (set_regs): Also pop if CUR_NODE has already been checked,
> so that it does not get added as a duplicate set entry.
> (update_regs): Fix comment location.
> * tests/test-regex.c (tests): New constant.
> (bug_regex11): New test function.
> (main): Bump alarm value.  Call new test function.
> ---
>  ChangeLog          |  13 ++++++
>  lib/regexec.c      |  26 +++++++----
>  tests/test-regex.c | 113 ++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 141 insertions(+), 11 deletions(-)

Apparently, this commit changes behaviour of GNU sed in a way that looks
like regression.

When GNU sed is built with this gnulib commit (v0.1-4426-g70b673eb7):
$ echo 123 |sed -r 's/^1*+(.)/\1/'
3
$ echo 123 |sed -r 's/^(1*)*(.)/\2/'
3

When GNU sed is built with the previous gnulib commit (v0.1-4425-gf5596242f):
$ echo 123 |sed -r 's/^1*+(.)/\1/'
23
$ echo 123 |sed -r 's/^(1*)*(.)/\2/'
23

There are regexps in the wild containing "*+" that used to work before,
see e.g.
https://lore.kernel.org/lkml/20210414182723.1670663-1-vt@altlinux.org/T/#u


-- 
ldv


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-04-16 22:15 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-06  1:25 [PATCH 01/10] regex: improve comments Paul Eggert
2021-02-06  1:25 ` [PATCH 02/10] regex: avoid undefined behavior Paul Eggert
2021-02-06  1:25 ` [PATCH 03/10] regex: minor refactoring Paul Eggert
2021-02-06  1:25 ` [PATCH 04/10] regex: make it easier to merge into glibc Paul Eggert
2021-02-06  1:25 ` [PATCH 05/10] regex-tests: fix typo Paul Eggert
2021-02-06  1:25 ` [PATCH 06/10] regex: avoid duplicate in espilon closure Paul Eggert
2021-02-06  1:25 ` [PATCH 07/10] regex: fix longstanding backref match bug Paul Eggert
2021-02-09 19:42   ` Adhemerval Zanella
2021-02-09 22:22     ` Paul Eggert
2021-04-16 21:37   ` Dmitry V. Levin
2021-02-06  1:26 ` [PATCH 08/10] regex: debug check for set member duplicates Paul Eggert
2021-02-06  1:26 ` [PATCH 09/10] regex-tests: add bug 11053 test Paul Eggert
2021-02-06  1:26 ` [PATCH 10/10] regex: fix comment location Paul Eggert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).