Re: [PATCH v4] Use a proper C tokenizer to implement the obsolete typedefs test.

unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Carlos O'Donell <carlos@redhat.com>
To: Zack Weinberg <zackw@panix.com>, libc-alpha@sourceware.org
Subject: Re: [PATCH v4] Use a proper C tokenizer to implement the obsolete typedefs test.
Date: Mon, 11 Mar 2019 14:57:52 -0400	[thread overview]
Message-ID: <36fa3d03-0dc4-bbd7-5c99-c8fe19255b8e@redhat.com> (raw)
In-Reply-To: <20190311145927.20399-1-zackw@panix.com>

On 3/11/19 10:59 AM, Zack Weinberg wrote:
> The test for obsolete typedefs in installed headers was implemented
> using grep, and could therefore get false positives on e.g. “ulong”
> in a comment.  It was also scanning all of the headers included by
> our headers, and therefore testing headers we don’t control, e.g.
> Linux kernel headers.

Correct.

> This patch splits the obsolete-typedef test from
> scripts/check-installed-headers.sh to a separate program,
> scripts/check-obsolete-constructs.py.  Being implemented in Python,
> it is feasible to make it tokenize C accurately enough to avoid false
> positives on the contents of comments and strings.  It also only
> examines $(headers) in each subdirectory--all the headers we install,
> but not any external dependencies of those headers.  Headers whose
> installed name starts with finclude/ are ignored, on the assumption
> that they contain Fortran.

OK.

> It is also feasible to make the new test understand the difference
> between _defining_ the obsolete typedefs and _using_ the obsolete
> typedefs, which means posix/{bits,sys}/types.h no longer need to be
> exempted.  This uncovered an actual bug in bits/types.h: __quad_t and
> __u_quad_t were being used to define __S64_TYPE, __U64_TYPE,
> __SQUAD_TYPE and __UQUAD_TYPE.  These are changed to __int64_t and
> __uint64_t respectively.  This is a safe change, despite the comments
> in bits/types.h claiming a difference between __quad_t and __int64_t,
> because those comments are incorrect.  In all current ABIs, both
> __quad_t and __int64_t are ‘long’ when ‘long’ is a 64-bit type, and
> ‘long long’ when ‘long’ is a 32-bit type, and similarly for __u_quad_t
> and __uint64_t.  (Changing the types to be what the comments say they
> are would be an ABI break, as it affects C++ name mangling.)  This
> patch includes a minimal change to make the comments not completely
> wrong.  I plan to remove __SQUAD_TYPE and __UQUAD_TYPE altogether in
> subseqent patches, but that would be inappropriate for backporting to
> release branches.

OK.

> sys/types.h was defining the legacy BSD u_intN_t typedefs using a
> construct that was not necessarily consistent with how the C99 uintN_t
> typedefs are defined, and is also too complicated for the new script to
> understand (it lexes C relatively accurately, but it does not attempt
> to expand preprocessor macros, nor does it do any actual parsing).
> This patch cuts all of that out and uses bits/types.h's __uintN_t typedefs
> to define u_intN_t instead.  This is verified to not change the ABI on
> any supported architecture, via the c++-types test, which means u_intN_t
> and uintN_t were, in fact, consistent on all supported architectures.
> I plan to restrict u_intN_t and some other legacy typedefs (but not
> intN_t) to __USE_MISC in subsequent patches, but again that would be
> inappropriate for backporting to release branches.

OK.

> 	* scripts/check-obsolete-constructs.py: New test script.
>         * scripts/check-installed-headers.sh: Remove tests for
>         obsolete typedefs, superseded by check-obsolete-constructs.py.
>         * Rules: Run scripts/check-obsolete-constructs.py over $(headers)
>         as a special test.  Update commentary.
>         * posix/bits/types.h (__SQUAD_TYPE, __S64_TYPE): Define as __int64_t.
>         (__UQUAD_TYPE, __U64_TYPE): Define as __uint64_t.
>         Update commentary.
>         * posix/sys/types.h (__u_intN_t): Remove.
>         (u_int8_t): Typedef using __uint8_t.
>         (u_int16_t): Typedef using __uint16_t.
>         (u_int32_t): Typedef using __uint32_t.
>         (u_int64_t): Typedef using __uint64_t.

OK for master if you:
- Fix the conditional in check-installed-headers.sh.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>

> ---
> 
> Changes since v3: Fortran headers are now detected by path
> (finclude/*) instead of looking for Emacs mode annotations within the
> file.  tokenize_c is now responsible for issuing errors for the BAD_*
> and OTHER lexical productions, and no longer returns BAD_* tokens.
> It is also responsible for tracking whether or not each token belongs
> to a preprocessing directive line, and accurately tokenizes #include
> arguments.  (Accurate tokenization of #include arguments will be
> required by future patches I have planned.)

OK.

> ---
>  Rules                                |  17 +-
>  posix/bits/types.h                   |  10 +-
>  posix/sys/types.h                    |  33 +-
>  scripts/check-installed-headers.sh   |  37 +--
>  scripts/check-obsolete-constructs.py | 466 +++++++++++++++++++++++++++
>  5 files changed, 500 insertions(+), 63 deletions(-)
>  create mode 100755 scripts/check-obsolete-constructs.py
> 
> diff --git a/Rules b/Rules
> index e08a28d9f3..222dba6dcb 100644
> --- a/Rules
> +++ b/Rules
> @@ -82,7 +82,8 @@ $(common-objpfx)dummy.c:
>  common-generated += dummy.o dummy.c
>  
>  ifneq "$(headers)" ""
> -# Special test of all the installed headers in this directory.
> +# Test that all of the headers installed by this directory can be compiled
> +# in isolation.

OK.

>  tests-special += $(objpfx)check-installed-headers-c.out
>  libof-check-installed-headers-c := testsuite
>  $(objpfx)check-installed-headers-c.out: \
> @@ -93,6 +94,8 @@ $(objpfx)check-installed-headers-c.out: \
>  	$(evaluate-test)
>  
>  ifneq "$(CXX)" ""
> +# If a C++ compiler is available, also test that they can be compiled
> +# in isolation as C++.

OK.

>  tests-special += $(objpfx)check-installed-headers-cxx.out
>  libof-check-installed-headers-cxx := testsuite
>  $(objpfx)check-installed-headers-cxx.out: \
> @@ -103,12 +106,24 @@ $(objpfx)check-installed-headers-cxx.out: \
>  	$(evaluate-test)
>  endif # $(CXX)
>  
> +# Test that a wrapper header exists in include/ for each non-sysdeps header.
> +# This script does not need $(py-env).

OK.

>  tests-special += $(objpfx)check-wrapper-headers.out
>  $(objpfx)check-wrapper-headers.out: \
>    $(..)scripts/check-wrapper-headers.py $(headers)
>  	$(PYTHON) $< --root=$(..) --subdir=$(subdir) $(headers) > $@; \
>  	  $(evaluate-test)
>  
> +# Test that none of the headers installed by this directory use certain
> +# obsolete constructs (e.g. legacy BSD typedefs superseded by stdint.h).
> +# This script does not need $(py-env).
> +tests-special += $(objpfx)check-obsolete-constructs.out
> +libof-check-obsolete-constructs := testsuite
> +$(objpfx)check-obsolete-constructs.out: \
> +    $(..)scripts/check-obsolete-constructs.py $(headers)
> +	$(PYTHON) $^ > $@ 2>&1; \
> +	$(evaluate-test)

OK.

> +
>  endif # $(headers)
>  \f
>  # This makes all the auxiliary and test programs.
> diff --git a/posix/bits/types.h b/posix/bits/types.h
> index 27e065c3be..0de6c59bb4 100644
> --- a/posix/bits/types.h
> +++ b/posix/bits/types.h
> @@ -87,7 +87,7 @@ __extension__ typedef unsigned long long int __uintmax_t;
>  	32		-- "natural" 32-bit type (always int)
>  	64		-- "natural" 64-bit type (long or long long)
>  	LONG32		-- 32-bit type, traditionally long
> -	QUAD		-- 64-bit type, always long long
> +	QUAD		-- 64-bit type, traditionally long long

OK.

>  	WORD		-- natural type of __WORDSIZE bits (int or long)
>  	LONGWORD	-- type of __WORDSIZE bits, traditionally long
>  
> @@ -113,14 +113,14 @@ __extension__ typedef unsigned long long int __uintmax_t;
>  #define __SLONGWORD_TYPE	long int
>  #define __ULONGWORD_TYPE	unsigned long int
>  #if __WORDSIZE == 32
> -# define __SQUAD_TYPE		__quad_t
> -# define __UQUAD_TYPE		__u_quad_t
> +# define __SQUAD_TYPE		__int64_t
> +# define __UQUAD_TYPE		__uint64_t

OK.

>  # define __SWORD_TYPE		int
>  # define __UWORD_TYPE		unsigned int
>  # define __SLONG32_TYPE		long int
>  # define __ULONG32_TYPE		unsigned long int
> -# define __S64_TYPE		__quad_t
> -# define __U64_TYPE		__u_quad_t
> +# define __S64_TYPE		__int64_t
> +# define __U64_TYPE		__uint64_t

OK.

>  /* We want __extension__ before typedef's that use nonstandard base types
>     such as `long long' in C89 mode.  */
>  # define __STD_TYPE		__extension__ typedef
> diff --git a/posix/sys/types.h b/posix/sys/types.h
> index 27129c5c23..0e37b1ce6a 100644
> --- a/posix/sys/types.h
> +++ b/posix/sys/types.h
> @@ -154,37 +154,20 @@ typedef unsigned int uint;
>  
>  #include <bits/stdint-intn.h>
>  
> -#if !__GNUC_PREREQ (2, 7)
> -
>  /* These were defined by ISO C without the first `_'.  */
> -typedef	unsigned char u_int8_t;
> -typedef	unsigned short int u_int16_t;
> -typedef	unsigned int u_int32_t;
> -# if __WORDSIZE == 64
> -typedef unsigned long int u_int64_t;
> -# else
> -__extension__ typedef unsigned long long int u_int64_t;
> -# endif
> -
> -typedef int register_t;

OK, this removed bug covered below.

> -
> -#else
> -
> -/* For GCC 2.7 and later, we can use specific type-size attributes.  */
> -# define __u_intN_t(N, MODE) \
> -  typedef unsigned int u_int##N##_t __attribute__ ((__mode__ (MODE)))
> -
> -__u_intN_t (8, __QI__);
> -__u_intN_t (16, __HI__);
> -__u_intN_t (32, __SI__);
> -__u_intN_t (64, __DI__);
> +typedef __uint8_t u_int8_t;
> +typedef __uint16_t u_int16_t;
> +typedef __uint32_t u_int32_t;
> +typedef __uint64_t u_int64_t;

OK, switch to 4 new typedefs that don't change meaning.

>  
> +#if __GNUC_PREREQ (2, 7)

OK, retain old check here.

>  typedef int register_t __attribute__ ((__mode__ (__word__)));
> -
> +#else
> +typedef int register_t;
> +#endif

OK.

>  
>  /* Some code from BIND tests this macro to see if the types above are
>     defined.  */
> -#endif

OK.

>  #define __BIT_TYPES_DEFINED__	1
>  
>  
> diff --git a/scripts/check-installed-headers.sh b/scripts/check-installed-headers.sh
> index 1f4496446c..e4f37d33f8 100644
> --- a/scripts/check-installed-headers.sh
> +++ b/scripts/check-installed-headers.sh
> @@ -16,11 +16,9 @@
>  # License along with the GNU C Library; if not, see
>  # <http://www.gnu.org/licenses/>.
>  
> -# Check installed headers for cleanliness.  For each header, confirm
> -# that it's possible to compile a file that includes that header and
> -# does nothing else, in several different compilation modes.  Also,
> -# scan the header for a set of obsolete typedefs that should no longer
> -# appear.
> +# For each installed header, confirm that it's possible to compile a
> +# file that includes that header and does nothing else, in several
> +# different compilation modes.

OK.

>  
>  # These compilation switches assume GCC or compatible, which is probably
>  # fine since we also assume that when _building_ glibc.
> @@ -31,13 +29,6 @@ cxx_modes="-std=c++98 -std=gnu++98 -std=c++11 -std=gnu++11"
>  # These are probably the most commonly used three.
>  lib_modes="-D_DEFAULT_SOURCE=1 -D_GNU_SOURCE=1 -D_XOPEN_SOURCE=700"
>  
> -# sys/types.h+bits/types.h have to define the obsolete types.
> -# rpc(svc)/* have the obsolete types too deeply embedded in their API
> -# to remove.
> -skip_obsolete_type_check='*/sys/types.h|*/bits/types.h|*/rpc/*|*/rpcsvc/*'
> -obsolete_type_re=\
> -'\<((__)?(quad_t|u(short|int|long|_(char|short|int([0-9]+_t)?|long|quad_t))))\>'
> -

OK.

>  if [ $# -lt 3 ]; then
>      echo "usage: $0 c|c++ \"compile command\" header header header..." >&2
>      exit 2
> @@ -46,14 +37,10 @@ case "$1" in
>      (c)
>          lang_modes="$c_modes"
>          cih_test_c=$(mktemp ${TMPDIR-/tmp}/cih_test_XXXXXX.c)
> -        already="$skip_obsolete_type_check"

OK.

>      ;;
>      (c++)
>          lang_modes="$cxx_modes"
>          cih_test_c=$(mktemp ${TMPDIR-/tmp}/cih_test_XXXXXX.cc)
> -        # The obsolete-type check can be skipped for C++; it is
> -        # sufficient to do it for C.
> -        already="*"

OK.

>      ;;
>      (*)
>          echo "usage: $0 c|c++ \"compile command\" header header header..." >&2
> @@ -155,22 +142,8 @@ $expanded_lib_mode
>  int avoid_empty_translation_unit;
>  EOF
>              if $cc_cmd -fsyntax-only $lang_mode "$cih_test_c" 2>&1
> -            then
> -                includes=$($cc_cmd -fsyntax-only -H $lang_mode \
> -                              "$cih_test_c" 2>&1 | sed -ne 's/^[.][.]* //p')
> -                for h in $includes; do
> -                    # Don't repeat work.
> -                    eval 'case "$h" in ('"$already"') continue;; esac'
> -
> -                    if grep -qE "$obsolete_type_re" "$h"; then
> -                        echo "*** Obsolete types detected:"
> -                        grep -HE "$obsolete_type_re" "$h"
> -                        failed=1
> -                    fi
> -                    already="$already|$h"
> -                done
> -            else
> -                failed=1
> +            then :
> +            else failed=1

Why not 'if ! $cc_cmd ...' ? Which avoids the odd empty if block e.g. ":".

>              fi
>          done
>      done
> diff --git a/scripts/check-obsolete-constructs.py b/scripts/check-obsolete-constructs.py
> new file mode 100755
> index 0000000000..46535afcac
> --- /dev/null
> +++ b/scripts/check-obsolete-constructs.py
> @@ -0,1 +1,466 @@
> +#! /usr/bin/python3

OK.

> +# Copyright (C) 2019 Free Software Foundation, Inc.
> +# This file is part of the GNU C Library.
> +#
> +# The GNU C Library is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU Lesser General Public
> +# License as published by the Free Software Foundation; either
> +# version 2.1 of the License, or (at your option) any later version.
> +#
> +# The GNU C Library is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +# Lesser General Public License for more details.
> +#
> +# You should have received a copy of the GNU Lesser General Public
> +# License along with the GNU C Library; if not, see
> +# <http://www.gnu.org/licenses/>.
> +
> +"""Verifies that installed headers do not use any obsolete constructs:
> + * legacy BSD typedefs superseded by <stdint.h>:
> +   ushort uint ulong u_char u_short u_int u_long u_intNN_t quad_t u_quad_t
> +   (sys/types.h is allowed to _define_ these types, but not to use them
> +    to define anything else).
> +"""
> +
> +import argparse
> +import collections
> +import re
> +import sys
> +
> +# Simplified lexical analyzer for C preprocessing tokens.
> +# Does not implement trigraphs.
> +# Does not implement backslash-newline in the middle of any lexical
> +#   item other than a string literal.
> +# Does not implement universal-character-names in identifiers.
> +# Treats prefixed strings (e.g. L"...") as two tokens (L and "...")
> +# Accepts non-ASCII characters only within comments and strings.
> +
> +# Caution: The order of the outermost alternation matters.
> +# STRING must be before BAD_STRING, CHARCONST before BAD_CHARCONST,
> +# BLOCK_COMMENT before BAD_BLOCK_COM before PUNCTUATOR, and OTHER must
> +# be last.
> +# Caution: There should be no capturing groups other than the named
> +# captures in the outermost alternation.
> +
> +# For reference, these are all of the C punctuators as of C11:
> +#   [ ] ( ) { } , ; ? ~
> +#   ! != * *= / /= ^ ^= = ==
> +#   # ##
> +#   % %= %> %: %:%:
> +#   & &= &&
> +#   | |= ||
> +#   + += ++
> +#   - -= -- ->
> +#   . ...
> +#   : :>
> +#   < <% <: << <<= <=
> +#   > >= >> >>=
> +
> +# The BAD_* tokens are not part of the official definition of pp-tokens;
> +# they match unclosed strings, character constants, and block comments,
> +# so that the regex engine doesn't have to backtrack all the way to the
> +# beginning of a broken construct and then emit dozens of junk tokens.
> +
> +PP_TOKEN_RE_ = re.compile(r"""
> +    (?P<STRING>        \"(?:[^\"\\\r\n]|\\(?:[\r\n -~]|\r\n))*\")
> +   |(?P<BAD_STRING>    \"(?:[^\"\\\r\n]|\\[ -~])*)
> +   |(?P<CHARCONST>     \'(?:[^\'\\\r\n]|\\(?:[\r\n -~]|\r\n))*\')
> +   |(?P<BAD_CHARCONST> \'(?:[^\'\\\r\n]|\\[ -~])*)
> +   |(?P<BLOCK_COMMENT> /\*(?:\*(?!/)|[^*])*\*/)
> +   |(?P<BAD_BLOCK_COM> /\*(?:\*(?!/)|[^*])*\*?)
> +   |(?P<LINE_COMMENT>  //[^\r\n]*)
> +   |(?P<IDENT>         [_a-zA-Z][_a-zA-Z0-9]*)
> +   |(?P<PP_NUMBER>     \.?[0-9](?:[0-9a-df-oq-zA-DF-OQ-Z_.]|[eEpP][+-]?)*)
> +   |(?P<PUNCTUATOR>
> +       [,;?~(){}\[\]]
> +     | [!*/^=]=?
> +     | \#\#?
> +     | %(?:[=>]|:(?:%:)?)?
> +     | &[=&]?
> +     |\|[=|]?
> +     |\+[=+]?
> +     | -[=->]?
> +     |\.(?:\.\.)?
> +     | :>?
> +     | <(?:[%:]|<(?:=|<=?)?)?
> +     | >(?:=|>=?)?)
> +   |(?P<ESCNL>         \\(?:\r|\n|\r\n))
> +   |(?P<WHITESPACE>    [ \t\n\r\v\f]+)
> +   |(?P<OTHER>         .)
> +""", re.DOTALL | re.VERBOSE)
> +
> +HEADER_NAME_RE_ = re.compile(r"""
> +    < [^>\r\n]+ >
> +  | " [^"\r\n]+ "
> +""", re.DOTALL | re.VERBOSE)
> +
> +ENDLINE_RE_ = re.compile(r"""\r|\n|\r\n""")
> +
> +# based on the sample code in the Python re documentation
> +Token_ = collections.namedtuple("Token", (
> +    "kind", "text", "line", "column", "context"))
> +Token_.__doc__ = """
> +   One C preprocessing token, comment, or chunk of whitespace.
> +   'kind' identifies the token type, which will be one of:
> +       STRING, CHARCONST, BLOCK_COMMENT, LINE_COMMENT, IDENT,
> +       PP_NUMBER, PUNCTUATOR, ESCNL, WHITESPACE, HEADER_NAME,
> +       or OTHER.  The BAD_* alternatives in PP_TOKEN_RE_ are
> +       handled within tokenize_c, below.
> +
> +   'text' is the sequence of source characters making up the token;
> +       no decoding whatsoever is performed.
> +
> +   'line' and 'column' give the position of the first character of the
> +      token within the source file.  They are both 1-based.
> +
> +   'context' indicates whether or not this token occurred within a
> +      preprocessing directive; it will be None for running text,
> +      '<null>' for the leading '#' of a directive line (because '#'
> +      all by itself on a line is a "null directive"), or the name of
> +      the directive for tokens within a directive line, starting with
> +      the IDENT for the name itself.
> +"""
> +
> +def tokenize_c(file_contents, reporter):
> +    """Yield a series of Token objects, one for each preprocessing
> +       token, comment, or chunk of whitespace within FILE_CONTENTS.
> +       The REPORTER object is expected to have one method,
> +       reporter.error(token, message), which will be called to
> +       indicate a lexical error at the position of TOKEN.
> +       If MESSAGE contains the four-character sequence '{!r}', that
> +       is expected to be replaced by repr(token.text).
> +    """
> +
> +    Token = Token_
> +    PP_TOKEN_RE = PP_TOKEN_RE_
> +    ENDLINE_RE = ENDLINE_RE_
> +    HEADER_NAME_RE = HEADER_NAME_RE_
> +
> +    line_num = 1
> +    line_start = 0
> +    pos = 0
> +    limit = len(file_contents)
> +    directive = None
> +    at_bol = True
> +    while pos < limit:
> +        if directive == "include":
> +            mo = HEADER_NAME_RE.match(file_contents, pos)
> +            if mo:
> +                kind = "HEADER_NAME"
> +                directive = "after_include"
> +            else:
> +                mo = PP_TOKEN_RE.match(file_contents, pos)
> +                kind = mo.lastgroup
> +                if kind != "WHITESPACE":
> +                    directive = "after_include"
> +        else:
> +            mo = PP_TOKEN_RE.match(file_contents, pos)
> +            kind = mo.lastgroup
> +
> +        text = mo.group()
> +        line = line_num
> +        column = mo.start() - line_start
> +        adj_line_start = 0
> +        # only these kinds can contain a newline
> +        if kind in ("WHITESPACE", "BLOCK_COMMENT", "LINE_COMMENT",
> +                    "STRING", "CHARCONST", "BAD_BLOCK_COM", "ESCNL"):
> +            for tmo in ENDLINE_RE.finditer(text):
> +                line_num += 1
> +                adj_line_start = tmo.end()
> +            if adj_line_start:
> +                line_start = mo.start() + adj_line_start
> +
> +        # Track whether or not we are scanning a preprocessing directive.
> +        if kind == "LINE_COMMENT" or (kind == "WHITESPACE" and adj_line_start):
> +            at_bol = True
> +            directive = None
> +        else:
> +            if kind == "PUNCTUATOR" and text == "#" and at_bol:
> +                directive = "<null>"
> +            elif kind == "IDENT" and directive == "<null>":
> +                directive = text
> +            at_bol = False
> +
> +        # Report ill-formed tokens and rewrite them as their well-formed
> +        # equivalents, so downstream processing doesn't have to know about them.
> +        # (Rewriting instead of discarding provides better error recovery.)
> +        if kind == "BAD_BLOCK_COM":
> +            reporter.error(Token("BAD_BLOCK_COM", "", line, column+1, ""),
> +                           "unclosed block comment")
> +            text += "*/"
> +            kind = "BLOCK_COMMENT"
> +        elif kind == "BAD_STRING":
> +            reporter.error(Token("BAD_STRING", "", line, column+1, ""),
> +                           "unclosed string")
> +            text += "\""
> +            kind = "STRING"
> +        elif kind == "BAD_CHARCONST":
> +            reporter.error(Token("BAD_CHARCONST", "", line, column+1, ""),
> +                           "unclosed char constant")
> +            text += "'"
> +            kind = "CHARCONST"
> +
> +        tok = Token(kind, text, line, column+1,
> +                    "include" if directive == "after_include" else directive)
> +        # Do not complain about OTHER tokens inside macro definitions.
> +        # $ and @ appear in macros defined by headers intended to be
> +        # included from assembly language, e.g. sysdeps/mips/sys/asm.h.
> +        if kind == "OTHER" and directive != "define":
> +            self.error(tok, "stray {!r} in program")
> +
> +        yield tok
> +        pos = mo.end()
> +
> +#
> +# Base and generic classes for individual checks.
> +#
> +
> +class ConstructChecker:
> +    """Scan a stream of C preprocessing tokens and possibly report
> +       problems with them.  The REPORTER object passed to __init__ has
> +       one method, reporter.error(token, message), which should be
> +       called to indicate a problem detected at the position of TOKEN.
> +       If MESSAGE contains the four-character sequence '{!r}' then that
> +       will be replaced with a textual representation of TOKEN.
> +    """
> +    def __init__(self, reporter):
> +        self.reporter = reporter
> +
> +    def examine(self, tok):
> +        """Called once for each token in a header file.
> +           Call self.reporter.error if a problem is detected.
> +        """
> +        raise NotImplementedError
> +
> +    def eof(self):
> +        """Called once at the end of the stream.  Subclasses need only
> +           override this if it might have something to do."""
> +        pass
> +
> +class NoCheck(ConstructChecker):
> +    """Generic checker class which doesn't do anything.  Substitute this
> +       class for a real checker when a particular check should be skipped
> +       for some file."""
> +
> +    def examine(self, tok):
> +        pass
> +
> +#
> +# Check for obsolete type names.
> +#
> +
> +# The obsolete type names we're looking for:
> +OBSOLETE_TYPE_RE_ = re.compile(r"""\A
> +  (__)?
> +  (   quad_t
> +    | u(?: short | int | long
> +         | _(?: char | short | int(?:[0-9]+_t)? | long | quad_t )))
> +\Z""", re.VERBOSE)
> +
> +class ObsoleteNotAllowed(ConstructChecker):
> +    """Don't allow any use of the obsolete typedefs."""
> +    def examine(self, tok):
> +        if OBSOLETE_TYPE_RE_.match(tok.text):
> +            self.reporter.error(tok, "use of {!r}")
> +
> +class ObsoletePrivateDefinitionsAllowed(ConstructChecker):
> +    """Allow definitions of the private versions of the
> +       obsolete typedefs; that is, 'typedef [anything] __obsolete;'
> +    """
> +    def __init__(self, reporter):
> +        super().__init__(reporter)
> +        self.in_typedef = False
> +        self.prev_token = None
> +
> +    def examine(self, tok):
> +        # bits/types.h hides 'typedef' in a macro sometimes.
> +        if (tok.kind == "IDENT"
> +            and tok.text in ("typedef", "__STD_TYPE")
> +            and tok.context is None):
> +            self.in_typedef = True
> +        elif tok.kind == "PUNCTUATOR" and tok.text == ";" and self.in_typedef:
> +            self.in_typedef = False
> +            if self.prev_token.kind == "IDENT":
> +                m = OBSOLETE_TYPE_RE_.match(self.prev_token.text)
> +                if m and m.group(1) != "__":
> +                    self.reporter.error(self.prev_token, "use of {!r}")
> +            self.prev_token = None
> +        else:
> +            self._check_prev()
> +
> +        self.prev_token = tok
> +
> +    def eof(self):
> +        self._check_prev()
> +
> +    def _check_prev(self):
> +        if (self.prev_token is not None
> +            and self.prev_token.kind == "IDENT"
> +            and OBSOLETE_TYPE_RE_.match(self.prev_token.text)):
> +            self.reporter.error(self.prev_token, "use of {!r}")
> +
> +class ObsoletePublicDefinitionsAllowed(ConstructChecker):
> +    """Allow definitions of the public versions of the obsolete
> +       typedefs.  Only specific forms of definition are allowed:
> +
> +           typedef __obsolete obsolete;  // identifiers must agree
> +           typedef __uintN_t u_intN_t;   // N must agree
> +           typedef unsigned long int ulong;
> +           typedef unsigned short int ushort;
> +           typedef unsigned int uint;
> +    """
> +    def __init__(self, reporter):
> +        super().__init__(reporter)
> +        self.typedef_tokens = []
> +
> +    def examine(self, tok):
> +        if tok.kind in ("WHITESPACE", "BLOCK_COMMENT",
> +                        "LINE_COMMENT", "NL", "ESCNL"):
> +            pass
> +
> +        elif (tok.kind == "IDENT" and tok.text == "typedef"
> +              and tok.context is None):
> +            if self.typedef_tokens:
> +                self.reporter.error(tok, "typedef inside typedef")
> +                self._reset()
> +            self.typedef_tokens.append(tok)
> +
> +        elif tok.kind == "PUNCTUATOR" and tok.text == ";":
> +            self._finish()
> +
> +        elif self.typedef_tokens:
> +            self.typedef_tokens.append(tok)
> +
> +    def eof(self):
> +        self._reset()
> +
> +    def _reset(self):
> +        while self.typedef_tokens:
> +            tok = self.typedef_tokens.pop(0)
> +            if tok.kind == "IDENT" and OBSOLETE_TYPE_RE_.match(tok.text):
> +                self.reporter.error(tok, "use of {!r}")
> +
> +    def _finish(self):
> +        if not self.typedef_tokens: return
> +        if self.typedef_tokens[-1].kind == "IDENT":
> +            m = OBSOLETE_TYPE_RE_.match(self.typedef_tokens[-1].text)
> +            if m:
> +                if self._permissible_public_definition(m):
> +                    self.typedef_tokens.clear()
> +        self._reset()
> +
> +    def _permissible_public_definition(self, m):
> +        if m.group(1) == "__": return False
> +        name = m.group(2)
> +        toks = self.typedef_tokens
> +        ntok = len(toks)
> +        if ntok == 3 and toks[1].kind == "IDENT":
> +            defn = toks[1].text
> +            n = OBSOLETE_TYPE_RE_.match(defn)
> +            if n and n.group(1) == "__" and n.group(2) == name:
> +                return True
> +
> +            if (name[:5] == "u_int" and name[-2:] == "_t"
> +                and defn[:6] == "__uint" and defn[-2:] == "_t"
> +                and name[5:-2] == defn[6:-2]):
> +                return True
> +
> +            return False
> +
> +        if (name == "ulong" and ntok == 5
> +            and toks[1].kind == "IDENT" and toks[1].text == "unsigned"
> +            and toks[2].kind == "IDENT" and toks[2].text == "long"
> +            and toks[3].kind == "IDENT" and toks[3].text == "int"):
> +            return True
> +
> +        if (name == "ushort" and ntok == 5
> +            and toks[1].kind == "IDENT" and toks[1].text == "unsigned"
> +            and toks[2].kind == "IDENT" and toks[2].text == "short"
> +            and toks[3].kind == "IDENT" and toks[3].text == "int"):
> +            return True
> +
> +        if (name == "uint" and ntok == 4
> +            and toks[1].kind == "IDENT" and toks[1].text == "unsigned"
> +            and toks[2].kind == "IDENT" and toks[2].text == "int"):
> +            return True
> +
> +        return False
> +
> +def ObsoleteTypedefChecker(reporter, fname):
> +    """Factory: produce an instance of the appropriate
> +       obsolete-typedef checker for FNAME."""
> +
> +    # The obsolete rpc/ and rpcsvc/ headers are allowed to use the
> +    # obsolete types, because it would be more trouble than it's
> +    # worth to remove them from headers that we intend to stop
> +    # installing eventually anyway.
> +    if (fname.startswith("rpc/")
> +        or fname.startswith("rpcsvc/")
> +        or "/rpc/" in fname
> +        or "/rpcsvc/" in fname):
> +        return NoCheck(reporter)
> +
> +    # bits/types.h is allowed to define the __-versions of the
> +    # obsolete types.
> +    if (fname == "bits/types.h"
> +        or fname.endswith("/bits/types.h")):
> +        return ObsoletePrivateDefinitionsAllowed(reporter)
> +
> +    # sys/types.h is allowed to use the __-versions of the
> +    # obsolete types, but only to define the unprefixed versions.
> +    if (fname == "sys/types.h"
> +        or fname.endswith("/sys/types.h")):
> +        return ObsoletePublicDefinitionsAllowed(reporter)
> +
> +    return ObsoleteNotAllowed(reporter)
> +
> +#
> +# Master control
> +#
> +
> +class HeaderChecker:
> +    """Perform all of the checks on each header.  This is also the
> +       "reporter" object expected by tokenize_c and ConstructChecker.
> +    """
> +    def __init__(self):
> +        self.fname = None
> +        self.status = 0
> +
> +    def error(self, tok, message):
> +        self.status = 1
> +        if '{!r}' in message:
> +            message = message.format(tok.text)
> +        sys.stderr.write("{}:{}:{}: error: {}\n".format(
> +            self.fname, tok.line, tok.column, message))
> +
> +    def check(self, fname):
> +        self.fname = fname
> +        try:
> +            with open(fname, "rt") as fp:
> +                contents = fp.read()
> +        except OSError as e:
> +            sys.stderr.write("{}: {}\n".format(fname, e.strerror))
> +            self.status = 1
> +            return
> +
> +        typedef_checker = ObsoleteTypedefChecker(self, self.fname)
> +
> +        for tok in tokenize_c(contents, self):
> +            typedef_checker.examine(tok)
> +
> +def main():
> +    ap = argparse.ArgumentParser(description=__doc__)
> +    ap.add_argument("headers", metavar="header", nargs="+",
> +                    help="one or more headers to scan for obsolete constructs")
> +    args = ap.parse_args()
> +
> +    checker = HeaderChecker()
> +    for fname in args.headers:
> +        # Headers whose installed name begins with "finclude/" contain
> +        # Fortran, not C, and this program should completely ignore them.
> +        if not (fname.startswith("finclude/") or "/finclude/" in fname):
> +            checker.check(fname)
> +    sys.exit(checker.status)
> +
> +main()
> 

OK.

-- 
Cheers,
Carlos.

next prev parent reply	other threads:[~2019-03-11 18:58 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-11 14:59 [PATCH v4] Use a proper C tokenizer to implement the obsolete typedefs test Zack Weinberg
2019-03-11 18:57 ` Carlos O'Donell [this message]
2019-03-12  0:59   ` Zack Weinberg
2019-03-12  3:47     ` Carlos O'Donell
2019-03-13 13:47       ` Zack Weinberg
2019-03-13 22:16         ` Joseph Myers
2019-03-14 13:00           ` Carlos O'Donell
2019-03-14 13:21             ` Zack Weinberg
2019-03-14 18:06               ` Joseph Myers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/libc/involved.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=36fa3d03-0dc4-bbd7-5c99-c8fe19255b8e@redhat.com \
    --to=carlos@redhat.com \
    --cc=libc-alpha@sourceware.org \
    --cc=zackw@panix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).