bug-gnulib@gnu.org mirror (unofficial)
 help / color / mirror / Atom feed
* [PATCH] grep: a kwset matcher not work in a grep matcher
@ 2019-03-22 23:06 Norihiro Tanaka
  2019-03-23  2:49 ` bug#34951: " Norihiro Tanaka
  0 siblings, 1 reply; 26+ messages in thread
From: Norihiro Tanaka @ 2019-03-22 23:06 UTC (permalink / raw)
  To: bug-grep; +Cc: bug-gnulib

[-- Attachment #1: Type: text/plain, Size: 2670 bytes --]

A kwset matcher is not built in a grep matcher after token re-order is
introduced in commit 5c7a0371823876cca7a1347fa09ca26bbbff0c98 in dfa.
It caused performance degradation in some typical cases.  This bug is
introduced in grep-3.2.

DFAMUST() does not work if tokens which are parsed in dfa matcher are
re-ordered.  Therefore, change as it is called after parse and before
tokens re-order.

BTW, this change does not affect programs that do not use DFAMUST(),
such as sed or gawk.

$ yes $(printf '%040d' 0) | head -10000000 >inp
$ grep-2.2/src/grep 01.2 inp
real 1.61
user 1.53
sys 0.07
$ grep-2.3/src/grep 01.2 inp
real 1.57
user 1.48
sys 0.08
$ grep-2.4/src/grep 01.2 inp
real 1.50
user 1.44
sys 0.05
$ grep-2.4.1/src/grep 01.2 inp
real 1.53
user 1.48
sys 0.05
$ grep-2.4.2/src/grep 01.2 inp
real 1.52
user 1.47
sys 0.04
$ grep-2.5.4/src/grep 01.2 inp
real 1.53
user 1.47
sys 0.05
$ grep-2.6/src/grep 01.2 inp
real 1.51
user 1.47
sys 0.04
$ grep-2.6.1/src/grep 01.2 inp
real 1.50
user 1.44
sys 0.05
$ grep-2.6.2/src/grep 01.2 inp
real 1.52
user 1.46
sys 0.05
$ grep-2.6.3/src/grep 01.2 inp
real 1.52
user 1.47
sys 0.05
$ grep-2.7/src/grep 01.2 inp
real 1.53
user 1.49
sys 0.04
$ grep-2.8/src/grep 01.2 inp
real 1.52
user 1.46
sys 0.05
$ grep-2.9/src/grep 01.2 inp
real 1.54
user 1.50
sys 0.04
$ grep-2.10/src/grep 01.2 inp
real 1.51
user 1.46
sys 0.05
$ grep-2.11/src/grep 01.2 inp
real 1.53
user 1.48
sys 0.05
$ grep-2.12/src/grep 01.2 inp
real 1.51
user 1.47
sys 0.03
$ grep-2.13/src/grep 01.2 inp
real 1.52
user 1.47
sys 0.03
$ grep-2.14/src/grep 01.2 inp
real 1.52
user 1.47
sys 0.04
$ grep-2.15/src/grep 01.2 inp
real 1.55
user 1.49
sys 0.05
$ grep-2.16/src/grep 01.2 inp
real 1.53
user 1.48
sys 0.04
$ grep-2.17/src/grep 01.2 inp
real 1.53
user 1.48
sys 0.05
$ grep-2.18/src/grep 01.2 inp
real 1.51
user 1.44
sys 0.06
$ grep-2.19/src/grep 01.2 inp
real 0.06
user 0.02
sys 0.04
$ grep-2.20/src/grep 01.2 inp
real 0.07
user 0.01
sys 0.05
$ grep-2.21/src/grep 01.2 inp
real 0.06
user 0.02
sys 0.04
$ grep-2.22/src/grep 01.2 inp
real 0.06
user 0.01
sys 0.05
$ grep-2.23/src/grep 01.2 inp
real 0.09
user 0.04
sys 0.05
$ grep-2.24/src/grep 01.2 inp
real 0.09
user 0.04
sys 0.04
$ grep-2.25/src/grep 01.2 inp
real 0.09
user 0.05
sys 0.04
$ grep-2.26/src/grep 01.2 inp
real 0.09
user 0.04
sys 0.05
$ grep-2.27/src/grep 01.2 inp
real 0.09
user 0.04
sys 0.04
$ grep-2.28/src/grep 01.2 inp
real 0.09
user 0.04
sys 0.04
$ grep-3.0/src/grep 01.2 inp
real 0.09
user 0.04
sys 0.04
$ grep-3.1/src/grep 01.2 inp
real 0.11
user 0.05
sys 0.06
$ grep-3.2/src/grep 01.2 inp
real 0.37
user 0.32
sys 0.04
$ grep-3.3/src/grep 01.2 inp
real 0.29
user 0.25
sys 0.04

Thanks,
Norihiro

[-- Attachment #2: 0001-dfa-separate-parse-and-compile-phase.patch --]
[-- Type: text/plain, Size: 1015 bytes --]

From fca6a4c3b9e0757637b7a2009ca8b9070a6874f5 Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka <noritnk@kcn.ne.jp>
Date: Sat, 23 Mar 2019 07:18:37 +0900
Subject: [PATCH] dfa: separate parse and compile phase

DFAMUST() must be called after parse and before tokens re-order which is
introduced in commit 5c7a0371823876cca7a1347fa09ca26bbbff0c98, but both are
executed in compilation phase.

* lib/dfa.c (dfaparse): Change it to global function.
(dfacomp): If first argument is NULL, skip parse.
* lib/dfa.h: (dfaparse): Add a prototype.
---
 src/dfasearch.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/src/dfasearch.c b/src/dfasearch.c
index 3ebd25e..f3f889f 100644
--- a/src/dfasearch.c
+++ b/src/dfasearch.c
@@ -202,8 +202,9 @@ GEAcompile (char *pattern, size_t size, reg_syntax_t syntax_bits)
   else
     motif = NULL;
 
-  dfacomp (pattern, size, dc->dfa, 1);
+  dfaparse (pattern, size, dc->dfa);
   kwsmusts (dc);
+  dfacomp (NULL, 0, dc->dfa, 1);
 
   free (motif);
 
-- 
1.7.1


[-- Attachment #3: 0001-grep-a-kwset-matcher-not-work-in-a-grep-matcher.patch --]
[-- Type: text/plain, Size: 975 bytes --]

From 60d47e1d9a5af18eeb61719c7ac8c8ae097a06e4 Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka <noritnk@kcn.ne.jp>
Date: Sat, 23 Mar 2019 07:39:04 +0900
Subject: [PATCH] grep: a kwset matcher not work in a grep matcher

A kwset matcher is not built in a grep matcher after tokens re-order is
introduced in commit 5c7a0371823876cca7a1347fa09ca26bbbff0c98 in dfa.
Now DFAMUST() must be called after parse and before compile.

* src/dfasearch.c (GEAcompile): Build a kwset match before compile dfa.
---
 src/dfasearch.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/src/dfasearch.c b/src/dfasearch.c
index 3ebd25e..f3f889f 100644
--- a/src/dfasearch.c
+++ b/src/dfasearch.c
@@ -202,8 +202,9 @@ GEAcompile (char *pattern, size_t size, reg_syntax_t syntax_bits)
   else
     motif = NULL;
 
-  dfacomp (pattern, size, dc->dfa, 1);
+  dfaparse (pattern, size, dc->dfa);
   kwsmusts (dc);
+  dfacomp (NULL, 0, dc->dfa, 1);
 
   free (motif);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher
  2019-03-22 23:06 [PATCH] grep: a kwset matcher not work in a grep matcher Norihiro Tanaka
@ 2019-03-23  2:49 ` Norihiro Tanaka
  2019-03-23  2:58   ` Budi
                     ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Norihiro Tanaka @ 2019-03-23  2:49 UTC (permalink / raw)
  To: 34951; +Cc: bug-gnulib

[-- Attachment #1: Type: text/plain, Size: 3124 bytes --]

On Sat, 23 Mar 2019 08:06:35 +0900
Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:

> A kwset matcher is not built in a grep matcher after token re-order is
> introduced in commit 5c7a0371823876cca7a1347fa09ca26bbbff0c98 in dfa.
> It caused performance degradation in some typical cases.  This bug is
> introduced in grep-3.2.
> 
> DFAMUST() does not work if tokens which are parsed in dfa matcher are
> re-ordered.  Therefore, change as it is called after parse and before
> tokens re-order.
> 
> BTW, this change does not affect programs that do not use DFAMUST(),
> such as sed or gawk.
> 
> $ yes $(printf '%040d' 0) | head -10000000 >inp
> $ grep-2.2/src/grep 01.2 inp
> real 1.61
> user 1.53
> sys 0.07
> $ grep-2.3/src/grep 01.2 inp
> real 1.57
> user 1.48
> sys 0.08
> $ grep-2.4/src/grep 01.2 inp
> real 1.50
> user 1.44
> sys 0.05
> $ grep-2.4.1/src/grep 01.2 inp
> real 1.53
> user 1.48
> sys 0.05
> $ grep-2.4.2/src/grep 01.2 inp
> real 1.52
> user 1.47
> sys 0.04
> $ grep-2.5.4/src/grep 01.2 inp
> real 1.53
> user 1.47
> sys 0.05
> $ grep-2.6/src/grep 01.2 inp
> real 1.51
> user 1.47
> sys 0.04
> $ grep-2.6.1/src/grep 01.2 inp
> real 1.50
> user 1.44
> sys 0.05
> $ grep-2.6.2/src/grep 01.2 inp
> real 1.52
> user 1.46
> sys 0.05
> $ grep-2.6.3/src/grep 01.2 inp
> real 1.52
> user 1.47
> sys 0.05
> $ grep-2.7/src/grep 01.2 inp
> real 1.53
> user 1.49
> sys 0.04
> $ grep-2.8/src/grep 01.2 inp
> real 1.52
> user 1.46
> sys 0.05
> $ grep-2.9/src/grep 01.2 inp
> real 1.54
> user 1.50
> sys 0.04
> $ grep-2.10/src/grep 01.2 inp
> real 1.51
> user 1.46
> sys 0.05
> $ grep-2.11/src/grep 01.2 inp
> real 1.53
> user 1.48
> sys 0.05
> $ grep-2.12/src/grep 01.2 inp
> real 1.51
> user 1.47
> sys 0.03
> $ grep-2.13/src/grep 01.2 inp
> real 1.52
> user 1.47
> sys 0.03
> $ grep-2.14/src/grep 01.2 inp
> real 1.52
> user 1.47
> sys 0.04
> $ grep-2.15/src/grep 01.2 inp
> real 1.55
> user 1.49
> sys 0.05
> $ grep-2.16/src/grep 01.2 inp
> real 1.53
> user 1.48
> sys 0.04
> $ grep-2.17/src/grep 01.2 inp
> real 1.53
> user 1.48
> sys 0.05
> $ grep-2.18/src/grep 01.2 inp
> real 1.51
> user 1.44
> sys 0.06
> $ grep-2.19/src/grep 01.2 inp
> real 0.06
> user 0.02
> sys 0.04
> $ grep-2.20/src/grep 01.2 inp
> real 0.07
> user 0.01
> sys 0.05
> $ grep-2.21/src/grep 01.2 inp
> real 0.06
> user 0.02
> sys 0.04
> $ grep-2.22/src/grep 01.2 inp
> real 0.06
> user 0.01
> sys 0.05
> $ grep-2.23/src/grep 01.2 inp
> real 0.09
> user 0.04
> sys 0.05
> $ grep-2.24/src/grep 01.2 inp
> real 0.09
> user 0.04
> sys 0.04
> $ grep-2.25/src/grep 01.2 inp
> real 0.09
> user 0.05
> sys 0.04
> $ grep-2.26/src/grep 01.2 inp
> real 0.09
> user 0.04
> sys 0.05
> $ grep-2.27/src/grep 01.2 inp
> real 0.09
> user 0.04
> sys 0.04
> $ grep-2.28/src/grep 01.2 inp
> real 0.09
> user 0.04
> sys 0.04
> $ grep-3.0/src/grep 01.2 inp
> real 0.09
> user 0.04
> sys 0.04
> $ grep-3.1/src/grep 01.2 inp
> real 0.11
> user 0.05
> sys 0.06
> $ grep-3.2/src/grep 01.2 inp
> real 0.37
> user 0.32
> sys 0.04
> $ grep-3.3/src/grep 01.2 inp
> real 0.29
> user 0.25
> sys 0.04
> 
> Thanks,
> Norihiro

Missing a patch for dfa.  Re-send correct patch file.

[-- Attachment #2: 0001-dfa-separate-parse-and-compile-phase.patch --]
[-- Type: text/plain, Size: 1916 bytes --]

From d12d3256043a792fe65830ff6469bba6418876e1 Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka <noritnk@kcn.ne.jp>
Date: Sat, 23 Mar 2019 08:19:11 +0900
Subject: [PATCH] dfa: separate parse and compile phase

DFAMUST() must be called after parse and before tokens re-order which is
introduced in commit 5c7a0371823876cca7a1347fa09ca26bbbff0c98, but both are
executed in compilation phase.

* lib/dfa.c (dfaparse): Change it to global function.
(dfacomp): If first argument is NULL, skip parse.
* lib/dfa.h: (dfaparse): Add a prototype.
---
 lib/dfa.c |    6 ++++--
 lib/dfa.h |    3 +++
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/lib/dfa.c b/lib/dfa.c
index 329a209..1e125b4 100644
--- a/lib/dfa.c
+++ b/lib/dfa.c
@@ -1969,7 +1969,7 @@ regexp (struct dfa *dfa)
 /* Main entry point for the parser.  S is a string to be parsed, len is the
    length of the string, so s can include NUL characters.  D is a pointer to
    the struct dfa to parse into.  */
-static void
+void
 dfaparse (char const *s, size_t len, struct dfa *d)
 {
   d->lex.ptr = s;
@@ -3745,7 +3745,9 @@ dfassbuild (struct dfa *d)
 void
 dfacomp (char const *s, size_t len, struct dfa *d, bool searchflag)
 {
-  dfaparse (s, len, d);
+  if (s != NULL)
+    dfaparse (s, len, d);
+
   dfassbuild (d);
 
   if (dfa_supported (d))
diff --git a/lib/dfa.h b/lib/dfa.h
index 60512e2..221f7d1 100644
--- a/lib/dfa.h
+++ b/lib/dfa.h
@@ -71,6 +71,9 @@ extern struct dfamust *dfamust (struct dfa const *);
 /* Free the storage held by the components of a struct dfamust. */
 extern void dfamustfree (struct dfamust *);
 
+/* Parse the given string of given length into the given struct dfa.  */
+extern void dfaparse (char const *, size_t, struct dfa *);
+
 /* Compile the given string of the given length into the given struct dfa.
    Final argument is a flag specifying whether to build a searching or an
    exact matcher. */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher
  2019-03-23  2:49 ` bug#34951: " Norihiro Tanaka
@ 2019-03-23  2:58   ` Budi
  2019-03-23  2:59     ` Budi
  2019-03-29 10:58   ` arnold
  2019-12-11 23:25   ` Paul Eggert
  2 siblings, 1 reply; 26+ messages in thread
From: Budi @ 2019-03-23  2:58 UTC (permalink / raw)
  To: Norihiro Tanaka; +Cc: 34951, bug-gnulib

How make grep walinh through FS by scanning breadth first instead of
the usual depth

On 3/23/19, Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
> On Sat, 23 Mar 2019 08:06:35 +0900
> Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
>
>> A kwset matcher is not built in a grep matcher after token re-order is
>> introduced in commit 5c7a0371823876cca7a1347fa09ca26bbbff0c98 in dfa.
>> It caused performance degradation in some typical cases.  This bug is
>> introduced in grep-3.2.
>>
>> DFAMUST() does not work if tokens which are parsed in dfa matcher are
>> re-ordered.  Therefore, change as it is called after parse and before
>> tokens re-order.
>>
>> BTW, this change does not affect programs that do not use DFAMUST(),
>> such as sed or gawk.
>>
>> $ yes $(printf '%040d' 0) | head -10000000 >inp
>> $ grep-2.2/src/grep 01.2 inp
>> real 1.61
>> user 1.53
>> sys 0.07
>> $ grep-2.3/src/grep 01.2 inp
>> real 1.57
>> user 1.48
>> sys 0.08
>> $ grep-2.4/src/grep 01.2 inp
>> real 1.50
>> user 1.44
>> sys 0.05
>> $ grep-2.4.1/src/grep 01.2 inp
>> real 1.53
>> user 1.48
>> sys 0.05
>> $ grep-2.4.2/src/grep 01.2 inp
>> real 1.52
>> user 1.47
>> sys 0.04
>> $ grep-2.5.4/src/grep 01.2 inp
>> real 1.53
>> user 1.47
>> sys 0.05
>> $ grep-2.6/src/grep 01.2 inp
>> real 1.51
>> user 1.47
>> sys 0.04
>> $ grep-2.6.1/src/grep 01.2 inp
>> real 1.50
>> user 1.44
>> sys 0.05
>> $ grep-2.6.2/src/grep 01.2 inp
>> real 1.52
>> user 1.46
>> sys 0.05
>> $ grep-2.6.3/src/grep 01.2 inp
>> real 1.52
>> user 1.47
>> sys 0.05
>> $ grep-2.7/src/grep 01.2 inp
>> real 1.53
>> user 1.49
>> sys 0.04
>> $ grep-2.8/src/grep 01.2 inp
>> real 1.52
>> user 1.46
>> sys 0.05
>> $ grep-2.9/src/grep 01.2 inp
>> real 1.54
>> user 1.50
>> sys 0.04
>> $ grep-2.10/src/grep 01.2 inp
>> real 1.51
>> user 1.46
>> sys 0.05
>> $ grep-2.11/src/grep 01.2 inp
>> real 1.53
>> user 1.48
>> sys 0.05
>> $ grep-2.12/src/grep 01.2 inp
>> real 1.51
>> user 1.47
>> sys 0.03
>> $ grep-2.13/src/grep 01.2 inp
>> real 1.52
>> user 1.47
>> sys 0.03
>> $ grep-2.14/src/grep 01.2 inp
>> real 1.52
>> user 1.47
>> sys 0.04
>> $ grep-2.15/src/grep 01.2 inp
>> real 1.55
>> user 1.49
>> sys 0.05
>> $ grep-2.16/src/grep 01.2 inp
>> real 1.53
>> user 1.48
>> sys 0.04
>> $ grep-2.17/src/grep 01.2 inp
>> real 1.53
>> user 1.48
>> sys 0.05
>> $ grep-2.18/src/grep 01.2 inp
>> real 1.51
>> user 1.44
>> sys 0.06
>> $ grep-2.19/src/grep 01.2 inp
>> real 0.06
>> user 0.02
>> sys 0.04
>> $ grep-2.20/src/grep 01.2 inp
>> real 0.07
>> user 0.01
>> sys 0.05
>> $ grep-2.21/src/grep 01.2 inp
>> real 0.06
>> user 0.02
>> sys 0.04
>> $ grep-2.22/src/grep 01.2 inp
>> real 0.06
>> user 0.01
>> sys 0.05
>> $ grep-2.23/src/grep 01.2 inp
>> real 0.09
>> user 0.04
>> sys 0.05
>> $ grep-2.24/src/grep 01.2 inp
>> real 0.09
>> user 0.04
>> sys 0.04
>> $ grep-2.25/src/grep 01.2 inp
>> real 0.09
>> user 0.05
>> sys 0.04
>> $ grep-2.26/src/grep 01.2 inp
>> real 0.09
>> user 0.04
>> sys 0.05
>> $ grep-2.27/src/grep 01.2 inp
>> real 0.09
>> user 0.04
>> sys 0.04
>> $ grep-2.28/src/grep 01.2 inp
>> real 0.09
>> user 0.04
>> sys 0.04
>> $ grep-3.0/src/grep 01.2 inp
>> real 0.09
>> user 0.04
>> sys 0.04
>> $ grep-3.1/src/grep 01.2 inp
>> real 0.11
>> user 0.05
>> sys 0.06
>> $ grep-3.2/src/grep 01.2 inp
>> real 0.37
>> user 0.32
>> sys 0.04
>> $ grep-3.3/src/grep 01.2 inp
>> real 0.29
>> user 0.25
>> sys 0.04
>>
>> Thanks,
>> Norihiro
>
> Missing a patch for dfa.  Re-send correct patch file.
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher
  2019-03-23  2:58   ` Budi
@ 2019-03-23  2:59     ` Budi
  2019-03-23 12:39       ` Eric Blake
  0 siblings, 1 reply; 26+ messages in thread
From: Budi @ 2019-03-23  2:59 UTC (permalink / raw)
  To: Norihiro Tanaka; +Cc: 34951, bug-gnulib

How make grep walking through FS by scanning breadth first instead of

On 3/23/19, Budi <budikusasi@gmail.com> wrote:
> How make grep walinh through FS by scanning breadth first instead of
> the usual depth
>
> On 3/23/19, Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
>> On Sat, 23 Mar 2019 08:06:35 +0900
>> Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
>>
>>> A kwset matcher is not built in a grep matcher after token re-order is
>>> introduced in commit 5c7a0371823876cca7a1347fa09ca26bbbff0c98 in dfa.
>>> It caused performance degradation in some typical cases.  This bug is
>>> introduced in grep-3.2.
>>>
>>> DFAMUST() does not work if tokens which are parsed in dfa matcher are
>>> re-ordered.  Therefore, change as it is called after parse and before
>>> tokens re-order.
>>>
>>> BTW, this change does not affect programs that do not use DFAMUST(),
>>> such as sed or gawk.
>>>
>>> $ yes $(printf '%040d' 0) | head -10000000 >inp
>>> $ grep-2.2/src/grep 01.2 inp
>>> real 1.61
>>> user 1.53
>>> sys 0.07
>>> $ grep-2.3/src/grep 01.2 inp
>>> real 1.57
>>> user 1.48
>>> sys 0.08
>>> $ grep-2.4/src/grep 01.2 inp
>>> real 1.50
>>> user 1.44
>>> sys 0.05
>>> $ grep-2.4.1/src/grep 01.2 inp
>>> real 1.53
>>> user 1.48
>>> sys 0.05
>>> $ grep-2.4.2/src/grep 01.2 inp
>>> real 1.52
>>> user 1.47
>>> sys 0.04
>>> $ grep-2.5.4/src/grep 01.2 inp
>>> real 1.53
>>> user 1.47
>>> sys 0.05
>>> $ grep-2.6/src/grep 01.2 inp
>>> real 1.51
>>> user 1.47
>>> sys 0.04
>>> $ grep-2.6.1/src/grep 01.2 inp
>>> real 1.50
>>> user 1.44
>>> sys 0.05
>>> $ grep-2.6.2/src/grep 01.2 inp
>>> real 1.52
>>> user 1.46
>>> sys 0.05
>>> $ grep-2.6.3/src/grep 01.2 inp
>>> real 1.52
>>> user 1.47
>>> sys 0.05
>>> $ grep-2.7/src/grep 01.2 inp
>>> real 1.53
>>> user 1.49
>>> sys 0.04
>>> $ grep-2.8/src/grep 01.2 inp
>>> real 1.52
>>> user 1.46
>>> sys 0.05
>>> $ grep-2.9/src/grep 01.2 inp
>>> real 1.54
>>> user 1.50
>>> sys 0.04
>>> $ grep-2.10/src/grep 01.2 inp
>>> real 1.51
>>> user 1.46
>>> sys 0.05
>>> $ grep-2.11/src/grep 01.2 inp
>>> real 1.53
>>> user 1.48
>>> sys 0.05
>>> $ grep-2.12/src/grep 01.2 inp
>>> real 1.51
>>> user 1.47
>>> sys 0.03
>>> $ grep-2.13/src/grep 01.2 inp
>>> real 1.52
>>> user 1.47
>>> sys 0.03
>>> $ grep-2.14/src/grep 01.2 inp
>>> real 1.52
>>> user 1.47
>>> sys 0.04
>>> $ grep-2.15/src/grep 01.2 inp
>>> real 1.55
>>> user 1.49
>>> sys 0.05
>>> $ grep-2.16/src/grep 01.2 inp
>>> real 1.53
>>> user 1.48
>>> sys 0.04
>>> $ grep-2.17/src/grep 01.2 inp
>>> real 1.53
>>> user 1.48
>>> sys 0.05
>>> $ grep-2.18/src/grep 01.2 inp
>>> real 1.51
>>> user 1.44
>>> sys 0.06
>>> $ grep-2.19/src/grep 01.2 inp
>>> real 0.06
>>> user 0.02
>>> sys 0.04
>>> $ grep-2.20/src/grep 01.2 inp
>>> real 0.07
>>> user 0.01
>>> sys 0.05
>>> $ grep-2.21/src/grep 01.2 inp
>>> real 0.06
>>> user 0.02
>>> sys 0.04
>>> $ grep-2.22/src/grep 01.2 inp
>>> real 0.06
>>> user 0.01
>>> sys 0.05
>>> $ grep-2.23/src/grep 01.2 inp
>>> real 0.09
>>> user 0.04
>>> sys 0.05
>>> $ grep-2.24/src/grep 01.2 inp
>>> real 0.09
>>> user 0.04
>>> sys 0.04
>>> $ grep-2.25/src/grep 01.2 inp
>>> real 0.09
>>> user 0.05
>>> sys 0.04
>>> $ grep-2.26/src/grep 01.2 inp
>>> real 0.09
>>> user 0.04
>>> sys 0.05
>>> $ grep-2.27/src/grep 01.2 inp
>>> real 0.09
>>> user 0.04
>>> sys 0.04
>>> $ grep-2.28/src/grep 01.2 inp
>>> real 0.09
>>> user 0.04
>>> sys 0.04
>>> $ grep-3.0/src/grep 01.2 inp
>>> real 0.09
>>> user 0.04
>>> sys 0.04
>>> $ grep-3.1/src/grep 01.2 inp
>>> real 0.11
>>> user 0.05
>>> sys 0.06
>>> $ grep-3.2/src/grep 01.2 inp
>>> real 0.37
>>> user 0.32
>>> sys 0.04
>>> $ grep-3.3/src/grep 01.2 inp
>>> real 0.29
>>> user 0.25
>>> sys 0.04
>>>
>>> Thanks,
>>> Norihiro
>>
>> Missing a patch for dfa.  Re-send correct patch file.
>>
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher
  2019-03-23  2:59     ` Budi
@ 2019-03-23 12:39       ` Eric Blake
  0 siblings, 0 replies; 26+ messages in thread
From: Eric Blake @ 2019-03-23 12:39 UTC (permalink / raw)
  To: Budi, Norihiro Tanaka; +Cc: 34951, bug-gnulib


[-- Attachment #1.1: Type: text/plain, Size: 1473 bytes --]

On 3/22/19 9:59 PM, Budi wrote:
> How make grep walking through FS by scanning breadth first instead of
> 
> On 3/23/19, Budi <budikusasi@gmail.com> wrote:
>> How make grep walinh through FS by scanning breadth first instead of
>> the usual depth
>>
>> On 3/23/19, Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
>>> On Sat, 23 Mar 2019 08:06:35 +0900
>>> Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
>>>
>>>> A kwset matcher is not built in a grep matcher after token re-order is

Budi,

Hijacking a tread on a posted patch to ask an unrelated question via
top-posting is not very nice netiquette.  Better is to start a new
thread for asking questions, and to use bottom posting for technical lists.

That said, the answer to your question is that there is no way to change
the way that grep walks the file system when using 'grep -r'. And when
you consider that 'grep -r' is a GNU extension not required by POSIX
(http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html),
and that we are reluctant to bloat grep any further when 'find' already
exists as the POSIX-sanctioned file walker, you are better off getting
'find' to do the traversal you want (where find or xargs is used to
invoke plain 'grep' on the resulting files) rather than trying to
convince us to patch 'grep -r' to have more flexibility.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher
  2019-03-23  2:49 ` bug#34951: " Norihiro Tanaka
  2019-03-23  2:58   ` Budi
@ 2019-03-29 10:58   ` arnold
  2019-12-11 23:25   ` Paul Eggert
  2 siblings, 0 replies; 26+ messages in thread
From: arnold @ 2019-03-29 10:58 UTC (permalink / raw)
  To: noritnk, 34951; +Cc: bug-gnulib

Hi.

Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:

> Missing a patch for dfa.  Re-send correct patch file.

Paul - is this going to be merged into GNULIB? If so, I'll put it into
gawk now; I want to make a release soon.

Thanks,

Arnold
[


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher
  2019-03-23  2:49 ` bug#34951: " Norihiro Tanaka
  2019-03-23  2:58   ` Budi
  2019-03-29 10:58   ` arnold
@ 2019-12-11 23:25   ` Paul Eggert
  2019-12-12  7:23     ` arnold
  2019-12-12  7:31     ` arnold
  2 siblings, 2 replies; 26+ messages in thread
From: Paul Eggert @ 2019-12-11 23:25 UTC (permalink / raw)
  To: Norihiro Tanaka, 34951; +Cc: Aharon Robbins, bug-gnulib

[-- Attachment #1: Type: text/plain, Size: 1037 bytes --]

On 3/22/19 7:49 PM, Norihiro Tanaka wrote:
> Missing a patch for dfa.  Re-send correct patch file.

Thanks, I installed the DFA-relevant parts of your proposed fix into 
Gnulib. (The grep parts still need doing.) I also installed the attached 
commentary followup.

While I was at it I installed a patch to fix an unlikely integer 
overflow that I noticed while reviewing your fix. I also installed some 
internal changes to prefer signed to unsigned integers for indexes, as 
this should make future integer overflows easier to catch. See:

https://lists.gnu.org/r/bug-gnulib/2019-12/msg00058.html
https://lists.gnu.org/r/bug-gnulib/2019-12/msg00059.html

I'd also like to change dfa.h's API to prefer ptrdiff_t to size_t, for 
the same integer-overflow reason. This would be a (minor) API change so 
I thought I'd ask first. Any objections?

PS. Arnold, the above discusses all the changes I know about for dfa.c 
and dfa.h. The proposed API change (size_t->ptrdiff_t) could be 
installed either before or after the next Gawk release.

[-- Attachment #2: 0001-dfa-update-commentary-for-previous-change.patch --]
[-- Type: text/x-patch, Size: 3855 bytes --]

From 360cbd3b17a314807e808626e100ef47dcf4d162 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Wed, 11 Dec 2019 13:40:01 -0800
Subject: [PATCH] dfa: update commentary for previous change

* NEWS: Mention the change.
* lib/dfa.c, lib/dfa.h (dfaparse, dfamust, dfacomp): Update comments.
---
 ChangeLog |  6 ++++++
 NEWS      |  4 ++++
 lib/dfa.c |  9 +++++----
 lib/dfa.h | 14 ++++++++------
 4 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index f80f33b38..bc912c771 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,9 @@
+2019-12-11  Paul Eggert  <eggert@cs.ucla.edu>
+
+	dfa: update commentary for previous change
+	* NEWS: Mention the change.
+	* lib/dfa.c, lib/dfa.h (dfaparse, dfamust, dfacomp): Update comments.
+
 2019-12-11  Norihiro Tanaka  <noritnk@kcn.ne.jp>
 
 	dfa: separate parse and compile phase
diff --git a/NEWS b/NEWS
index 8085c353e..b73c9088a 100644
--- a/NEWS
+++ b/NEWS
@@ -58,6 +58,10 @@ User visible incompatible changes
 
 Date        Modules         Changes
 
+2019-12-11  dfa             To call dfamust, one must now call dfaparse
+                            without yet calling dfacomp.  This fixes a bug
+                            introduced on 2018-10-22 that broke dfamust.
+
 2019-12-07  xstrtol         This module no longer defines the function
             xstrtoll        'xstrtol_fatal'.  Program that need this function
             xstrtoimax      should add the module 'xstrtol-error' to the list
diff --git a/lib/dfa.c b/lib/dfa.c
index 1e125b4d2..2347a91c1 100644
--- a/lib/dfa.c
+++ b/lib/dfa.c
@@ -1966,9 +1966,8 @@ regexp (struct dfa *dfa)
     }
 }
 
-/* Main entry point for the parser.  S is a string to be parsed, len is the
-   length of the string, so s can include NUL characters.  D is a pointer to
-   the struct dfa to parse into.  */
+/* Parse a string S of length LEN into D.  S can include NUL characters.
+   This is the main entry point for the parser.  */
 void
 dfaparse (char const *s, size_t len, struct dfa *d)
 {
@@ -3741,7 +3740,9 @@ dfassbuild (struct dfa *d)
     }
 }
 
-/* Parse and analyze a single string of the given length.  */
+/* Parse a string S of length LEN into D (but skip this step if S is null).
+   Then analyze D and build a matcher for it.
+   SEARCHFLAG says whether to build a searching or an exact matcher.  */
 void
 dfacomp (char const *s, size_t len, struct dfa *d, bool searchflag)
 {
diff --git a/lib/dfa.h b/lib/dfa.h
index 221f7d172..bf87703e8 100644
--- a/lib/dfa.h
+++ b/lib/dfa.h
@@ -65,18 +65,20 @@ enum
 extern void dfasyntax (struct dfa *, struct localeinfo const *,
                        reg_syntax_t, int);
 
-/* Build and return the struct dfamust from the given struct dfa. */
+/* Parse the given string of given length into the given struct dfa.  */
+extern void dfaparse (char const *, size_t, struct dfa *);
+
+/* Allocate and return a struct dfamust from a struct dfa that was
+   initialized by dfaparse and not yet given to dfacomp.  */
 extern struct dfamust *dfamust (struct dfa const *);
 
 /* Free the storage held by the components of a struct dfamust. */
 extern void dfamustfree (struct dfamust *);
 
-/* Parse the given string of given length into the given struct dfa.  */
-extern void dfaparse (char const *, size_t, struct dfa *);
-
 /* Compile the given string of the given length into the given struct dfa.
-   Final argument is a flag specifying whether to build a searching or an
-   exact matcher. */
+   The last argument says whether to build a searching or an exact matcher.
+   A null first argument means the struct dfa has already been
+   initialized by dfaparse; the second argument is ignored.  */
 extern void dfacomp (char const *, size_t, struct dfa *, bool);
 
 /* Search through a buffer looking for a match to the given struct dfa.
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher
  2019-12-11 23:25   ` Paul Eggert
@ 2019-12-12  7:23     ` arnold
  2019-12-12  7:31     ` arnold
  1 sibling, 0 replies; 26+ messages in thread
From: arnold @ 2019-12-12  7:23 UTC (permalink / raw)
  To: noritnk, eggert, 34951; +Cc: arnold, bug-gnulib

Hi Paul.

Paul Eggert <eggert@cs.ucla.edu> wrote:

> On 3/22/19 7:49 PM, Norihiro Tanaka wrote:
> > Missing a patch for dfa.  Re-send correct patch file.
>
> Thanks, I installed the DFA-relevant parts of your proposed fix into 
> Gnulib. (The grep parts still need doing.) I also installed the attached 
> commentary followup.
>
> While I was at it I installed a patch to fix an unlikely integer 
> overflow that I noticed while reviewing your fix. I also installed some 
> internal changes to prefer signed to unsigned integers for indexes, as 
> this should make future integer overflows easier to catch. See:
>
> https://lists.gnu.org/r/bug-gnulib/2019-12/msg00058.html
> https://lists.gnu.org/r/bug-gnulib/2019-12/msg00059.html

I am reviewing these. In general using signed integers internally
looks OK to me.

> I'd also like to change dfa.h's API to prefer ptrdiff_t to size_t, for 
> the same integer-overflow reason. This would be a (minor) API change so 
> I thought I'd ask first. Any objections?

Yes. I object. Strongly.

We're passing length and count values and those are supposed
to be size_t.  If you REALLY want signed values, then I could
live with ssize_t (as returned by read(2), for example), but I
would find ptrdiff_t to be ugly and unintuitive.

> PS. Arnold, the above discusses all the changes I know about for dfa.c 
> and dfa.h. The proposed API change (size_t->ptrdiff_t) could be 
> installed either before or after the next Gawk release.

Thanks. I'm skimming the other changes now.

Arnold


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher
  2019-12-11 23:25   ` Paul Eggert
  2019-12-12  7:23     ` arnold
@ 2019-12-12  7:31     ` arnold
  2019-12-12  7:47       ` arnold
  2019-12-12 22:26       ` Paul Eggert
  1 sibling, 2 replies; 26+ messages in thread
From: arnold @ 2019-12-12  7:31 UTC (permalink / raw)
  To: noritnk, eggert, 34951; +Cc: arnold, bug-gnulib

Hi Paul.

Paul Eggert <eggert@cs.ucla.edu> wrote:

> https://lists.gnu.org/r/bug-gnulib/2019-12/msg00058.html
> https://lists.gnu.org/r/bug-gnulib/2019-12/msg00059.html

Looking at this:

| @@ -1733,11 +1733,11 @@ add_utf8_anychar (struct dfa *dfa)
|      /* f0-f7: 4-byte sequence.  */
|      CHARCLASS_INIT (0, 0, 0, 0, 0, 0, 0, 0xff0000)
|    };
| -  const unsigned int n = sizeof (utf8_classes) / sizeof (utf8_classes[0]);
| +  int n = sizeof utf8_classes / sizeof *utf8_classes;

Why are you throwing away const here?

Other than this, I think internally too, I'd prefer that you

	1,$s/ptrdiff_t/ssize_t/g

(and fix any printf calls).  It just feels like an abuse of
the type, which is for representing differences between pointers,
and not regular large signed integeers.

However, I'm not going to insist about it internally, whereas
I would object strongly to the use of ptrdiff_t in the API.

Thanks!

Arnold


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher
  2019-12-12  7:31     ` arnold
@ 2019-12-12  7:47       ` arnold
  2019-12-12 22:26       ` Paul Eggert
  1 sibling, 0 replies; 26+ messages in thread
From: arnold @ 2019-12-12  7:47 UTC (permalink / raw)
  To: noritnk, eggert, arnold, 34951; +Cc: arnold, bug-gnulib

arnold@skeeve.com wrote:

> Other than this, I think internally too, I'd prefer that you
>
> 	1,$s/ptrdiff_t/ssize_t/g

I did this, just to see. gawk passes its test suite, both in
64- and 32-bit mode.

FWIW.

Thanks,

Arnold


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher
  2019-12-12  7:31     ` arnold
  2019-12-12  7:47       ` arnold
@ 2019-12-12 22:26       ` Paul Eggert
  2019-12-13  8:09         ` arnold
  1 sibling, 1 reply; 26+ messages in thread
From: Paul Eggert @ 2019-12-12 22:26 UTC (permalink / raw)
  To: arnold, noritnk, 34951; +Cc: bug-gnulib

On 12/11/19 11:31 PM, arnold@skeeve.com wrote:

> 	1,$s/ptrdiff_t/ssize_t/g

ssize_t can be narrower than ptrdiff_t, so it's not a good type to use 
for this notion. Its original motivation was "the type that 'read' 
returns", and on systems where 'read' can return at most INT_MAX, 
ssize_t can be 32 bits even if size_t is 64 bits.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher
  2019-12-12 22:26       ` Paul Eggert
@ 2019-12-13  8:09         ` arnold
  2019-12-13 12:08           ` arnold
  0 siblings, 1 reply; 26+ messages in thread
From: arnold @ 2019-12-13  8:09 UTC (permalink / raw)
  To: noritnk, eggert, arnold, 34951; +Cc: bug-gnulib

Hi Paul.

Paul Eggert <eggert@cs.ucla.edu> wrote:

> On 12/11/19 11:31 PM, arnold@skeeve.com wrote:
>
> > 	1,$s/ptrdiff_t/ssize_t/g
>
> ssize_t can be narrower than ptrdiff_t, so it's not a good type to use 
> for this notion. Its original motivation was "the type that 'read' 
> returns", and on systems where 'read' can return at most INT_MAX, 
> ssize_t can be 32 bits even if size_t is 64 bits.

In practice, how many system are there where ssize_t is 32 bits and size_t
is 64? If that number is <= 5 then I wouldn't worry about using ssize_t.

In any case, as I said, I can live with ptrdiff_t in the implementation,
even though I don't like it that much.  (A nice block comment at the
top of dfa.c explaining why ptrdiff_t is used would be appropriate.)

But I really don't want ptrdiff_t in the API.

Thanks,

Arnold

Thanks,

Arnold


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher
  2019-12-13  8:09         ` arnold
@ 2019-12-13 12:08           ` arnold
  2019-12-13 17:53             ` Jim Meyering
  0 siblings, 1 reply; 26+ messages in thread
From: arnold @ 2019-12-13 12:08 UTC (permalink / raw)
  To: noritnk, eggert, arnold, 34951; +Cc: bug-gnulib, jim

arnold@skeeve.com wrote:

> But I really don't want ptrdiff_t in the API.

I see that Paul has made the change to the API over my objections.

Jim --- do you have an opinion on this?

Thanks,

Arnold


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher
  2019-12-13 12:08           ` arnold
@ 2019-12-13 17:53             ` Jim Meyering
  2019-12-13 20:00               ` Paul Eggert
  0 siblings, 1 reply; 26+ messages in thread
From: Jim Meyering @ 2019-12-13 17:53 UTC (permalink / raw)
  To: Aharon Robbins
  Cc: Paul Eggert, bug-gnulib@gnu.org List, Norihiro Tanaka, 34951

On Fri, Dec 13, 2019 at 4:08 AM <arnold@skeeve.com> wrote:
> arnold@skeeve.com write:
>
> > But I really don't want ptrdiff_t in the API.
>
> I see that Paul has made the change to the API over my objections.
>
> Jim --- do you have an opinion on this?

Hi Aharon,

I used to feel the way you do. However, given the way compilers and
static/dynamic analysis have evolved, I have come around to Paul's
point of view. It still feels "wrong" in some sense, but using the
signed type makes the code more robust, enabling automatic
detection/avoidance of more bugs than with unsigned types. Thus, a net
improvement.

Paul, can you point to a link that lists the benefits/tradeoffs? If I
had such a link handy, I would have provided it here.

Jim


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher
  2019-12-13 17:53             ` Jim Meyering
@ 2019-12-13 20:00               ` Paul Eggert
  2019-12-14  2:35                 ` intptr_t vs. uintptr_t Bruno Haible
  2019-12-15  8:14                 ` bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher arnold
  0 siblings, 2 replies; 26+ messages in thread
From: Paul Eggert @ 2019-12-13 20:00 UTC (permalink / raw)
  To: Jim Meyering, Aharon Robbins; +Cc: 34951, Gnulib bugs, Norihiro Tanaka

>> I see that Paul has made the change to the API over my objections.

I made the change while responding to Bruno's objections, but before 
seeing yours. Ooops. Sorry about that. However, I hope the followup 
emails have addressed your comments, at least to some extent.

> Paul, can you point to a link that lists the benefits/tradeoffs? If I
> had such a link handy, I would have provided it here.

Avoiding unsigned types for indexes and sizes seems to be a growing 
movement. Admittedly there are arguments for unsigned, but these 
arguments are getting weaker with time. Here are a couple of links, the 
first for C and the second for C++:

https://www.gnu.org/software/emacs/manual/html_node/elisp/C-Integer-Types.html

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1428r0.pdf

As for ssize_t vs ptrdiff_t: ssize_t is less central to the C language 
(ptrdiff_t is in the C standard but ssize_t is not). And ssize_t is less 
convenient: for example, there's no simple, portable way to printf an 
ssize_t value, as there is with "%td" and ptrdiff_t. So there are 
technical reasons for preferring ptrdiff_t to ssize_t for this sort of 
thing (even though "ssize_t" is a shorter and better name). Thich is why 
Emacs, other parts of Gnulib, and other Gnu applications have used 
ptrdiff_t instead of ssize_t for this sort of thing.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: intptr_t vs. uintptr_t
  2019-12-13 20:00               ` Paul Eggert
@ 2019-12-14  2:35                 ` Bruno Haible
  2019-12-14  3:19                   ` Paul Eggert
  2019-12-15  8:14                 ` bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher arnold
  1 sibling, 1 reply; 26+ messages in thread
From: Bruno Haible @ 2019-12-14  2:35 UTC (permalink / raw)
  To: bug-gnulib; +Cc: Paul Eggert

Paul Eggert wrote:
> https://www.gnu.org/software/emacs/manual/html_node/elisp/C-Integer-Types.html

Quoting it:
  "Prefer intptr_t for internal representations of pointers"

I disagree with this advice. uintptr_t ought to be used for representing the
address of a pointer.

Why? Because when signed comparisons or pointer differences come into play,
  - uintptr_t creates a boundary line at 0x00000000,
  - intptr_t creates a boundary line at  0x80000000.
Now look at the virtual memory map of a process (e.g. by compiling vma-iter.c
with -DTEST).

On all OSes, there is a natural boundary line at 0x00000000 - simply because
there is the null-pointer catching area there.

On many OSes, memory allocations can lie around 0x80000000.

So, it is possible to have ptr1 = 0x7fffc000 and ptr2 = 0x80003000 point into
the same object (allocated through mmap or malloc). Then
  - you DO want ptr1 < ptr2 to evaluate to true, not false,
  - you DO want ptr2 - ptr1 to evaluate to 0x7000, not to a signed integer
    overflow.

Bruno



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: intptr_t vs. uintptr_t
  2019-12-14  2:35                 ` intptr_t vs. uintptr_t Bruno Haible
@ 2019-12-14  3:19                   ` Paul Eggert
  2019-12-14  9:14                     ` Bruno Haible
  0 siblings, 1 reply; 26+ messages in thread
From: Paul Eggert @ 2019-12-14  3:19 UTC (permalink / raw)
  To: Bruno Haible; +Cc: bug-gnulib

On 12/13/19 6:35 PM, Bruno Haible wrote:
>   "Prefer intptr_t for internal representations of pointers"
> 
> I disagree with this advice. uintptr_t ought to be used for representing the
> address of a pointer.

It depends on the application. For example, with two char * pointers P and Q
into an array, it can be helpful that P - Q yields the same integer as
((intptr_t) P - (intptr_t) Q), assuming the usual representation. That's not
true for uintptr_t.

In practice, Emacs uses uintptr_t quite a bit for things like hashes and tags;
but it uses intptr_t a bit more, so the advice seems reasonable for Emacs.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: intptr_t vs. uintptr_t
  2019-12-14  3:19                   ` Paul Eggert
@ 2019-12-14  9:14                     ` Bruno Haible
  2019-12-14 22:29                       ` Paul Eggert
  0 siblings, 1 reply; 26+ messages in thread
From: Bruno Haible @ 2019-12-14  9:14 UTC (permalink / raw)
  To: Paul Eggert; +Cc: bug-gnulib

Hi Paul,

> >   "Prefer intptr_t for internal representations of pointers"
> > 
> > I disagree with this advice. uintptr_t ought to be used for representing the
> > address of a pointer.
> 
> It depends on the application. For example, with two char * pointers P and Q
> into an array, it can be helpful that P - Q yields the same integer as
> ((intptr_t) P - (intptr_t) Q), assuming the usual representation.

Suppose that we have an array that extends from 0x7fff8000 to 0x80003fff - this
can happen! look at the address range maps -, and
- either P = 0x7fffc000 and Q = 0x80003000
  then ((intptr_t) P - (intptr_t) Q) causes a signed integer overflow
  (value > INTPTR_MAX),
- or vice versa: P = 0x80003000 and Q = 0x7fffc000
  then ((intptr_t) P - (intptr_t) Q) causes a signed integer overflow
  (value < INTPTR_MIN).
So, with an undefined-behaviour sanitizer enabled, the program will crash.

> That's not true for uintptr_t.

With uintptr_t, the program will not crash for an address difference computation.
The expression (intptr_t)((uintptr_t) P - (uintptr_t) Q) will always return
the expected value. (Here we assume that no array consumes more than half of
the address range, a constraint that is also your justification for using
ptrdiff_t.)

> In practice, Emacs uses uintptr_t quite a bit for things like hashes and tags;
> but it uses intptr_t a bit more, so the advice seems reasonable for Emacs.

No, the advice is bad, even for Emacs.

Bruno



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: intptr_t vs. uintptr_t
  2019-12-14  9:14                     ` Bruno Haible
@ 2019-12-14 22:29                       ` Paul Eggert
  2019-12-15  0:35                         ` Bruno Haible
  0 siblings, 1 reply; 26+ messages in thread
From: Paul Eggert @ 2019-12-14 22:29 UTC (permalink / raw)
  To: Bruno Haible; +Cc: bug-gnulib

[-- Attachment #1: Type: text/plain, Size: 583 bytes --]

On 12/14/19 1:14 AM, Bruno Haible wrote:
> Suppose that we have an array that extends from 0x7fff8000 to 0x80003fff

Ah, I hadn't thought about that. Thanks for mentioning it.

With Emacs's use of intptr_t this should not be an issue, since Emacs either
does no arithmetic on intptr_t values, or does only minor arithmetic (typically
pointer tagging) that keeps the address on the same page. However, you're right
that uintptr_t is preferable in cases that might cross page boundaries, and I
installed the attached into the Emacs Lisp reference manual to try to capture
that advice.

[-- Attachment #2: 0001-Adjust-intptr_t-advice.patch --]
[-- Type: text/x-patch, Size: 1505 bytes --]

From 67adb673e799b394eab346e44a08b63daa0412ae Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Sat, 14 Dec 2019 14:22:03 -0800
Subject: [PATCH] Adjust intptr_t advice

* doc/lispref/internals.texi (C Integer Types): Say to prefer
uintptr_t when pointer arithmetic might overflow intptr_t.
---
 doc/lispref/internals.texi | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/doc/lispref/internals.texi b/doc/lispref/internals.texi
index 8c55f4ea37..2a4e64dbb5 100644
--- a/doc/lispref/internals.texi
+++ b/doc/lispref/internals.texi
@@ -2825,12 +2825,14 @@ C Integer Types
 @code{SSIZE_MAX}.
 
 @item
-Prefer @code{intptr_t} for internal representations of pointers, or
+Normally, prefer @code{intptr_t} for internal representations of pointers, or
 for integers bounded only by the number of objects that can exist at
 any given time or by the total number of bytes that can be allocated.
-Currently Emacs sometimes uses other types when @code{intptr_t} would
-be better; fixing this is lower priority, as the code works as-is on
-Emacs's current porting targets.
+However, prefer @code{uintptr_t} to represent pointer arithmetic that
+could cross page boundaries.  For example, on a machine with a 32-bit
+address space an array could cross the 0x7fffffff/0x80000000 boundary,
+which would cause an integer overflow when adding 1 to
+@code{(intptr_t) 0x7fffffff}.
 
 @item
 Prefer the Emacs-defined type @code{EMACS_INT} for representing values
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: intptr_t vs. uintptr_t
  2019-12-14 22:29                       ` Paul Eggert
@ 2019-12-15  0:35                         ` Bruno Haible
  2019-12-16 10:02                           ` Paul Eggert
  0 siblings, 1 reply; 26+ messages in thread
From: Bruno Haible @ 2019-12-15  0:35 UTC (permalink / raw)
  To: Paul Eggert; +Cc: bug-gnulib

Hi Paul,

> I installed the attached into the Emacs Lisp reference manual to try to capture
> that advice.

Thanks. This is better. But the advice can be simplified: If pointer arithmetic
is involved, uintptr_t is better suited than intptr_t. And if pointer arithmetic
is not involved, uintptr_t and intptr_t are equivalent and equally good. So,
a simpler advice would be to always prefer uintptr_t (to represent pointer
values).

Bruno



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher
  2019-12-13 20:00               ` Paul Eggert
  2019-12-14  2:35                 ` intptr_t vs. uintptr_t Bruno Haible
@ 2019-12-15  8:14                 ` arnold
  2019-12-16  9:56                   ` Paul Eggert
  1 sibling, 1 reply; 26+ messages in thread
From: arnold @ 2019-12-15  8:14 UTC (permalink / raw)
  To: jim, eggert, arnold; +Cc: 34951, bug-gnulib, noritnk

OK. I skimmed the links.  But why not write the code to say what
we mean?  For example:

	#include <stdint.h>
	typedef int64_t dfa_size_t;

	extern void dfaparse (char const *, dfa_size_t, struct dfa *);
	extern void dfacomp (char const *, dfa_size_t, struct dfa *, bool);
			      bool allow_nl, dfa_size_t *count, bool *backref);

Using ptrdiff_t directly simply because it is defined to be the
largest signed integer remains ugly (and Paul has already moved to
a typedef in the implementation.)

int64_t is just as standard as ptrdiff_t and just as clear.

Thanks,

Arnold

Paul Eggert <eggert@cs.ucla.edu> wrote:

> >> I see that Paul has made the change to the API over my objections.
>
> I made the change while responding to Bruno's objections, but before 
> seeing yours. Ooops. Sorry about that. However, I hope the followup 
> emails have addressed your comments, at least to some extent.
>
> > Paul, can you point to a link that lists the benefits/tradeoffs? If I
> > had such a link handy, I would have provided it here.
>
> Avoiding unsigned types for indexes and sizes seems to be a growing 
> movement. Admittedly there are arguments for unsigned, but these 
> arguments are getting weaker with time. Here are a couple of links, the 
> first for C and the second for C++:
>
> https://www.gnu.org/software/emacs/manual/html_node/elisp/C-Integer-Types.html
>
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1428r0.pdf
>
> As for ssize_t vs ptrdiff_t: ssize_t is less central to the C language 
> (ptrdiff_t is in the C standard but ssize_t is not). And ssize_t is less 
> convenient: for example, there's no simple, portable way to printf an 
> ssize_t value, as there is with "%td" and ptrdiff_t. So there are 
> technical reasons for preferring ptrdiff_t to ssize_t for this sort of 
> thing (even though "ssize_t" is a shorter and better name). Thich is why 
> Emacs, other parts of Gnulib, and other Gnu applications have used 
> ptrdiff_t instead of ssize_t for this sort of thing.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher
  2019-12-15  8:14                 ` bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher arnold
@ 2019-12-16  9:56                   ` Paul Eggert
  2019-12-16 10:12                     ` arnold
  0 siblings, 1 reply; 26+ messages in thread
From: Paul Eggert @ 2019-12-16  9:56 UTC (permalink / raw)
  To: arnold, jim; +Cc: 34951, bug-gnulib, noritnk

On 12/15/19 12:14 AM, arnold@skeeve.com wrote:

> int64_t is just as standard as ptrdiff_t and just as clear.

Actually, int64_t is optional (as even C18 and POSIX-2018 do not require it),
whereas ptrdiff_t has been required since C89. More importantly, int64_t would
be overkill on 32-bit GNU/Linux, whereas ptrdiff_t suffices and is typically
more efficient.

(Besides, what would we do if 72-bit pointers came back into vogue? :-)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: intptr_t vs. uintptr_t
  2019-12-15  0:35                         ` Bruno Haible
@ 2019-12-16 10:02                           ` Paul Eggert
  0 siblings, 0 replies; 26+ messages in thread
From: Paul Eggert @ 2019-12-16 10:02 UTC (permalink / raw)
  To: Bruno Haible; +Cc: bug-gnulib

On 12/14/19 4:35 PM, Bruno Haible wrote:
> the advice can be simplified: If pointer arithmetic
> is involved, uintptr_t is better suited than intptr_t. And if pointer arithmetic
> is not involved, uintptr_t and intptr_t are equivalent and equally good.

It's more complicated in Emacs, because Emacs sometimes converts small integers
to pointers and then back again, and these integers can be negative. (The C
standard doesn't guarantee that this works, but Emacs is deliberately
nonportable in this low-level area and it does work on Emacs's current
platforms.) For such conversions, signed integers are more convenient.

It is a messy area, admittedly.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher
  2019-12-16  9:56                   ` Paul Eggert
@ 2019-12-16 10:12                     ` arnold
  2019-12-20  3:18                       ` Paul Eggert
  0 siblings, 1 reply; 26+ messages in thread
From: arnold @ 2019-12-16 10:12 UTC (permalink / raw)
  To: jim, eggert, arnold; +Cc: 34951, bug-gnulib, noritnk

Paul Eggert <eggert@cs.ucla.edu> wrote:

> On 12/15/19 12:14 AM, arnold@skeeve.com wrote:
>
> > int64_t is just as standard as ptrdiff_t and just as clear.
>
> Actually, int64_t is optional (as even C18 and POSIX-2018 do not require it),
> whereas ptrdiff_t has been required since C89. More importantly, int64_t would
> be overkill on 32-bit GNU/Linux, whereas ptrdiff_t suffices and is typically
> more efficient.
>
> (Besides, what would we do if 72-bit pointers came back into vogue? :-)

Fine.

What about

	typedef ptrdiff_t dfa_size_t

?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher
  2019-12-16 10:12                     ` arnold
@ 2019-12-20  3:18                       ` Paul Eggert
  2019-12-20 10:35                         ` arnold
  0 siblings, 1 reply; 26+ messages in thread
From: Paul Eggert @ 2019-12-20  3:18 UTC (permalink / raw)
  To: arnold, jim; +Cc: 34951, bug-gnulib, noritnk

On 12/16/19 2:12 AM, arnold@skeeve.com wrote:
> What about
> 
> 	typedef ptrdiff_t dfa_size_t

That declaration would imply that the type is specific to DFAs. However, the
type is used (with exactly the same meaning) in a lot of other places. This is
why I used the more-generic name "idx_t" internally dfa.c.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher
  2019-12-20  3:18                       ` Paul Eggert
@ 2019-12-20 10:35                         ` arnold
  0 siblings, 0 replies; 26+ messages in thread
From: arnold @ 2019-12-20 10:35 UTC (permalink / raw)
  To: jim, eggert, arnold; +Cc: 34951, bug-gnulib, noritnk

Paul Eggert <eggert@cs.ucla.edu> wrote:

> On 12/16/19 2:12 AM, arnold@skeeve.com wrote:
> > What about
> > 
> > 	typedef ptrdiff_t dfa_size_t
>
> That declaration would imply that the type is specific to DFAs. However, the
> type is used (with exactly the same meaning) in a lot of other places. This is
> why I used the more-generic name "idx_t" internally dfa.c.

I give up. Leave it ptrdiff_t.  I may submit comment changes for dfa.h
later.

Arnold


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2019-12-20 10:35 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-22 23:06 [PATCH] grep: a kwset matcher not work in a grep matcher Norihiro Tanaka
2019-03-23  2:49 ` bug#34951: " Norihiro Tanaka
2019-03-23  2:58   ` Budi
2019-03-23  2:59     ` Budi
2019-03-23 12:39       ` Eric Blake
2019-03-29 10:58   ` arnold
2019-12-11 23:25   ` Paul Eggert
2019-12-12  7:23     ` arnold
2019-12-12  7:31     ` arnold
2019-12-12  7:47       ` arnold
2019-12-12 22:26       ` Paul Eggert
2019-12-13  8:09         ` arnold
2019-12-13 12:08           ` arnold
2019-12-13 17:53             ` Jim Meyering
2019-12-13 20:00               ` Paul Eggert
2019-12-14  2:35                 ` intptr_t vs. uintptr_t Bruno Haible
2019-12-14  3:19                   ` Paul Eggert
2019-12-14  9:14                     ` Bruno Haible
2019-12-14 22:29                       ` Paul Eggert
2019-12-15  0:35                         ` Bruno Haible
2019-12-16 10:02                           ` Paul Eggert
2019-12-15  8:14                 ` bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher arnold
2019-12-16  9:56                   ` Paul Eggert
2019-12-16 10:12                     ` arnold
2019-12-20  3:18                       ` Paul Eggert
2019-12-20 10:35                         ` arnold

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).