From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS22989 209.51.188.0/24 X-Spam-Status: No, score=-3.6 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_LOW, SPF_HELO_NONE,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 25C031F466 for ; Wed, 29 Jan 2020 19:17:43 +0000 (UTC) Received: from localhost ([::1]:50396 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1iwsqL-0000dV-MJ for normalperson@yhbt.net; Wed, 29 Jan 2020 14:17:41 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:44810) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1iwsqC-0000dP-37 for bug-gnulib@gnu.org; Wed, 29 Jan 2020 14:17:33 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1iwsq8-0007kS-MY for bug-gnulib@gnu.org; Wed, 29 Jan 2020 14:17:31 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:44114) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1iwsq8-0007hp-BR for bug-gnulib@gnu.org; Wed, 29 Jan 2020 14:17:28 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 84D92164084; Wed, 29 Jan 2020 11:16:41 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id f_3xakvq68ma; Wed, 29 Jan 2020 11:16:38 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 4AD31163EC4; Wed, 29 Jan 2020 11:16:38 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id KM0r-6Wn23zT; Wed, 29 Jan 2020 11:16:38 -0800 (PST) Received: from Penguin.CS.UCLA.EDU (Penguin.CS.UCLA.EDU [131.179.64.200]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 25D7E163CC4; Wed, 29 Jan 2020 11:16:38 -0800 (PST) Subject: Re: dfa.c no longer usable if no 64-bit support To: Bruno Haible , bug-gnulib@gnu.org References: <202001291418.00TEIGrO030551@freefriends.org> <33627708.OG805fn80Z@omega> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: Date: Wed, 29 Jan 2020 11:16:37 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.4.1 MIME-Version: 1.0 In-Reply-To: <33627708.OG805fn80Z@omega> Content-Type: multipart/mixed; boundary="------------E654CBBF2381ED88E6453CE8" Content-Language: en-US X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 131.179.128.68 X-BeenThere: bug-gnulib@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Gnulib discussion list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: arnold@skeeve.com Errors-To: bug-gnulib-bounces+normalperson=yhbt.net@gnu.org Sender: "bug-gnulib" This is a multi-part message in MIME format. --------------E654CBBF2381ED88E6453CE8 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit On 1/29/20 7:34 AM, Bruno Haible wrote: > I would say that it's not worth the effort - except for the person(s) > who care a lot about Vax/VMS. Normally I'd agree, but if Arnold cares about VAX/VMS and if we want Gnulib dfa.c to match Gawk dfa.c, then in this particular case it makes some sense to support 32-bit-only platforms, as it's easy to revert the recent patch that made dfa.c assume 64-bit. So I installed the attached. However, I see some other parts of departure for Gawk dfa.c: * Gawk dfa.c/dfa.h does not use flexible array members or the portable-to-7th-edition-Unix substitute provided by Gnulib, so I suggest that Gawk import Gnulib lib/flexmember.h, and either "#define FLEXIBLE_ARRAY_MEMBER 1" in config.h or (better) import Gnulib m4/flexmember.m4. * Gawk dfa.c doesn't use isblank, but instead defines its own is_blank that is hard-coded to the C locale. Isn't [[:blank:]] supposed to be locale-dependent? Or are you assuming that space and tab are the only blank characters in all single-byte locales? * Gawk dfa.c includes mbsupport.h if __DJGPP__ is defined. I suggest moving this to Gawk config.h so that dfa.c need not worry about it. * Gawk dfa.c replaces "#include " with: #ifndef VMS #include #else #define SIZE_MAX __INT32_MAX #define PTRDIFF_MAX __INT32_MAX #endif I suppose we could add something like this to Gnulib dfa.c but it's a bit ugly; is there a cleaner way to do it? Perhaps Gawk could supply its own little substitute stdint.h on VMS. (Gnulib does this too but I assume Gnulib's stdint.h is too heavyweight for Gawk.) --------------E654CBBF2381ED88E6453CE8 Content-Type: text/x-patch; charset=UTF-8; name="0001-dfa-do-not-assume-64-bit-int.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="0001-dfa-do-not-assume-64-bit-int.patch" >From 335bfddb5ea0e6138a026ae723ea1e0ee2a2cd90 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Wed, 29 Jan 2020 10:58:26 -0800 Subject: [PATCH] dfa: do not assume 64-bit int Problem reported for VAX/VMS C (!) by Arnold Robbins in: https://lists.gnu.org/r/bug-gnulib/2020-01/msg00173.html * lib/dfa.c (CHARCLASS_PAIR): Bring back this macro. (CHARCLASS_WORD_BITS, charclass_word) [!UINT_LEAST64_MAX]: Fall back to 32-bit words. (CHARCLASS_INIT): Go back to having 8 32-bit args instead of 4 64-bit args. All uses changed. --- ChangeLog | 11 +++++++++++ lib/dfa.c | 40 +++++++++++++++++++++++++++++----------- 2 files changed, 40 insertions(+), 11 deletions(-) diff --git a/ChangeLog b/ChangeLog index a861f4996..2e64116c1 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,14 @@ +2020-01-29 Paul Eggert + + dfa: do not assume 64-bit int + Problem reported for VAX/VMS C (!) by Arnold Robbins in: + https://lists.gnu.org/r/bug-gnulib/2020-01/msg00173.html + * lib/dfa.c (CHARCLASS_PAIR): Bring back this macro. + (CHARCLASS_WORD_BITS, charclass_word) [!UINT_LEAST64_MAX]: + Fall back to 32-bit words. + (CHARCLASS_INIT): Go back to having 8 32-bit args instead + of 4 64-bit args. All uses changed. + 2020-01-27 Paul Eggert regex: remove limits-h dependency diff --git a/lib/dfa.c b/lib/dfa.c index 96ae560b1..4e9478394 100644 --- a/lib/dfa.c +++ b/lib/dfa.c @@ -84,6 +84,8 @@ isasciidigit (char c) /* First integer value that is greater than any character code. */ enum { NOTCHAR = 1 << CHAR_BIT }; +#ifdef UINT_LEAST64_MAX + /* Number of bits used in a charclass word. */ enum { CHARCLASS_WORD_BITS = 64 }; @@ -91,8 +93,24 @@ enum { CHARCLASS_WORD_BITS = 64 }; at least CHARCLASS_WORD_BITS wide. Any excess bits are zero. */ typedef uint_least64_t charclass_word; -/* An initializer for a charclass whose 64-bit words are A through D. */ -#define CHARCLASS_INIT(a, b, c, d) {{a, b, c, d}} +/* Part of a charclass initializer that represents 64 bits' worth of a + charclass, where LO and HI are the low and high-order 32 bits of + the 64-bit quantity. */ +# define CHARCLASS_PAIR(lo, hi) (((charclass_word) (hi) << 32) + (lo)) + +#else +/* Fallbacks for pre-C99 hosts that lack 64-bit integers. */ +enum { CHARCLASS_WORD_BITS = 32 }; +typedef unsigned long charclass_word; +# define CHARCLASS_PAIR(lo, hi) lo, hi +#endif + +/* An initializer for a charclass whose 32-bit words are A through H. */ +#define CHARCLASS_INIT(a, b, c, d, e, f, g, h) \ + {{ \ + CHARCLASS_PAIR (a, b), CHARCLASS_PAIR (c, d), \ + CHARCLASS_PAIR (e, f), CHARCLASS_PAIR (g, h) \ + }} /* The maximum useful value of a charclass_word; all used bits are 1. */ static charclass_word const CHARCLASS_WORD_MASK @@ -1699,39 +1717,39 @@ add_utf8_anychar (struct dfa *dfa) static charclass const utf8_classes[] = { /* A. 00-7f: 1-byte sequence. */ - CHARCLASS_INIT (0xffffffffffffffff, 0xffffffffffffffff, 0, 0), + CHARCLASS_INIT (0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0, 0, 0, 0), /* B. c2-df: 1st byte of a 2-byte sequence. */ - CHARCLASS_INIT (0, 0, 0, 0x00000000fffffffc), + CHARCLASS_INIT (0, 0, 0, 0, 0, 0, 0xfffffffc, 0), /* C. 80-bf: non-leading bytes. */ - CHARCLASS_INIT (0, 0, 0xffffffffffffffff, 0), + CHARCLASS_INIT (0, 0, 0, 0, 0xffffffff, 0xffffffff, 0, 0), /* D. e0 (just a token). */ /* E. a0-bf: 2nd byte of a "DEC" sequence. */ - CHARCLASS_INIT (0, 0, 0xffffffff00000000, 0), + CHARCLASS_INIT (0, 0, 0, 0, 0, 0xffffffff, 0, 0), /* F. e1-ec + ee-ef: 1st byte of an "FCC" sequence. */ - CHARCLASS_INIT (0, 0, 0, 0x0000dffe00000000), + CHARCLASS_INIT (0, 0, 0, 0, 0, 0, 0, 0xdffe), /* G. ed (just a token). */ /* H. 80-9f: 2nd byte of a "GHC" sequence. */ - CHARCLASS_INIT (0, 0, 0x000000000000ffff, 0), + CHARCLASS_INIT (0, 0, 0, 0, 0xffff, 0, 0, 0), /* I. f0 (just a token). */ /* J. 90-bf: 2nd byte of an "IJCC" sequence. */ - CHARCLASS_INIT (0, 0, 0xffffffffffff0000, 0), + CHARCLASS_INIT (0, 0, 0, 0, 0xffff0000, 0xffffffff, 0, 0), /* K. f1-f3: 1st byte of a "KCCC" sequence. */ - CHARCLASS_INIT (0, 0, 0, 0x000e000000000000), + CHARCLASS_INIT (0, 0, 0, 0, 0, 0, 0, 0xe0000), /* L. f4 (just a token). */ /* M. 80-8f: 2nd byte of a "LMCC" sequence. */ - CHARCLASS_INIT (0, 0, 0x00000000000000ff, 0), + CHARCLASS_INIT (0, 0, 0, 0, 0xff, 0, 0, 0), }; /* Define the character classes that are needed below. */ -- 2.24.1 --------------E654CBBF2381ED88E6453CE8--