From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS22989 209.51.188.0/24 X-Spam-Status: No, score=-2.2 required=3.0 tests=BAYES_00,BODY_8BITS, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,SPF_HELO_NONE,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 3A3EB1F463 for ; Sat, 28 Dec 2019 14:56:14 +0000 (UTC) Received: from localhost ([::1]:43896 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ilDVk-0006w4-Ec for normalperson@yhbt.net; Sat, 28 Dec 2019 09:56:12 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:42259) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ilBv9-00016Z-4o for bug-gnulib@gnu.org; Sat, 28 Dec 2019 08:14:20 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ilBv7-0003fE-9N for bug-gnulib@gnu.org; Sat, 28 Dec 2019 08:14:18 -0500 Received: from mail-ed1-x541.google.com ([2a00:1450:4864:20::541]:38700) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1ilBv6-0003Yj-R6 for bug-gnulib@gnu.org; Sat, 28 Dec 2019 08:14:17 -0500 Received: by mail-ed1-x541.google.com with SMTP id i16so27830139edr.5 for ; Sat, 28 Dec 2019 05:14:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to :user-agent; bh=s68uAKEo9e9jKZ9yrAPRzyTtj+xd0W6xGHhhx94LCHQ=; b=WAThnZm/06BzYxv++1XGq8ytCzHPmmhbIX/LAUj8pdUCFBMkHWPN15dRMYze1nW9d6 GdqMJ5yUgFpNAzBgqmiKQ0uLzHcr6OX1i6tf85kv4k/ZprY24lE6Xr4Zd2Ub2jbRF50e mvlNcvni1L9NXJjZ1YZMxDop/JXfCugciMlWk6V9EqE87srh6iLbkukYrNZJDOXNpnSD aJFp9iUFQC2IGUNyOa9gAK0p7az0JOI2zZXjdL+BzUj70b3WRh1o3BoYSJMDudk76zqj T6STtJQdh3EA8DnMEOYGZjpXuq+t4M67PVEI2Pul4F402UT89QMe0bMf1pqYa54yR687 jbPA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=s68uAKEo9e9jKZ9yrAPRzyTtj+xd0W6xGHhhx94LCHQ=; b=M97+HAXn8vDiqhkGwImtIH07osRYHQcKQqfVDBFYQxoNK2hEpGA2BxNkzVQANh99fQ 2h/xyDi7h4IA1Z+5vI6JPgviqc8xCur6iSCBCyzc8lv6BKEvWg9z1Z6m+uFHP36ZayTA bKgVLoj64jAOkgf5mPgUrocyM61JCgIkapRCh5hMf/SKrZgM3chsbC+saPMgPx2v0Tjt uOIXHAttrJSwlP8DiPVJvn5BS5R55kayvSZuiRI0rg4DKecIO+WbVmfGVflB8d21h/Hy TQHoY/oJY8IavaB0HAh8LDuGkgWRsorVHheqBZKuWH+6hnNgdjjGgEdMnx/Fy51hDdNc aMBw== X-Gm-Message-State: APjAAAVJPcFIVt97DNwn50RDDAqiAP3D58J9oWR7CziHEQ0oHS9i0Fgw cpisyGafObObFDVcbTmRoOWzjXIB+Qc= X-Google-Smtp-Source: APXvYqzI2Xp445abCHqSaX7xxaSyjM35hupl2W8cS6Za9e1vIbntTe454UUUrqqMUNyo+GBzvJTeKg== X-Received: by 2002:a17:906:3195:: with SMTP id 21mr60394553ejy.207.1577538854875; Sat, 28 Dec 2019 05:14:14 -0800 (PST) Received: from localhost (ppp046176149051.access.hol.gr. [46.176.149.51]) by smtp.googlemail.com with ESMTPSA id 2sm4431483edv.87.2019.12.28.05.14.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 28 Dec 2019 05:14:13 -0800 (PST) Date: Sat, 28 Dec 2019 15:14:38 +0200 From: ag To: bug-gnulib@gnu.org, Bruno Haible , Paul Eggert Subject: Re: string types Message-ID: <20191228131438.GA797@HATZ> References: <175192568.e2XXTFFdkW@omega> <20191226221225.GA800@HATZ> <2179574.G9OhZXe8sF@omega> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <2179574.G9OhZXe8sF@omega> User-Agent: Mutt/1.12.1 (2019-06-15) X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2a00:1450:4864:20::541 X-Mailman-Approved-At: Sat, 28 Dec 2019 09:56:08 -0500 X-BeenThere: bug-gnulib@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Gnulib discussion list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Tim =?utf-8?Q?R=C3=BChsen?= Errors-To: bug-gnulib-bounces+normalperson=yhbt.net@gnu.org Sender: "bug-gnulib" Hi, On Fri, Dec 27, at 11:51 Bruno Haible wrote: > - providing primitives for string allocation reduces the amount of buffer > overflow bugs that otherwise occur in this area. [1] [1] Re: string allocation https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html Thanks, i remember this thread, though at the time i couldn't understand some bits. >> ag wrote: > > ... to the actual algorithm (usually conditions that can or can't be met). > That is the idea behind the container types (list, map) in gnulib. However, I don't > see how to reasonably transpose this principle to string types. Ok, let us try, so allow me to summarize with some of (my unqualified) assumptions (please correct): - glibc malloc can request at most PTRDIFF_MAX - PTRDIFF_MAX is at least INT_MAX and at most SIZE_MAX (PTRDIFF_MAX is INT_MAX in 32bit) - SIZE_MAX as (size_t) (-1) - ssize_t (s means signed?) can be as big as SIZE_MAX? and SSIZE_MAX equals to SIZE_MAX? - the returned value of the *printf family of functions dictates their limits/range, as they return an int, this can be as INT_MAX mostly Some concerns: - truncation errors should be caught - memory checkers should catch overflows - as since there is a "risk"¹ that someone has to take at some point (either the programmer or the underlying library code (as strdup() does)), the designed interface should lower those risks There is a proposal from Eric Sanchis to Austin group at 9 Jun 2016, for a String copy/concatenation interface, that his functions have both the allocated size and the number of bytes to be written as arguments (some i will inline them here, since i was unable to find his mail in the Posix mailing list archives). I used this as a basis (as it was rather intuitive and perfectly suited for C), to implement my own str_cp, which goes like this: size_t str_cp (char *dest, size_t dest_len, const char *src, size_t nelem) { size_t num = (nelem > (dest_len - 1) ? dest_len - 1 : nelem); size_t len = (NULL is src ? 0 : byte_cp (dest, src, num)); dest[len] = '\0'; return len; } size_t byte_cp (char *dest, const char *src, size_t nelem) { const char *sp = src; size_t len = 0; while (len < nelem and *sp) { dest[len] = *sp++; len++; } return len; } Of course it can be done better, but here we have a low level function (byte_cp), that does only the required checks and which returns the actual bytes written to `dest', while str_cp checks if `src' is NULL and if `nelem' is bigger than `dest_len' (if it is then copies at least `dest_len' - 1). It returns 0 or the actual written bytes. Since this returns the actual bytes written, it is up to the programmer to check if truncation happened, but there is no possibility to copy more than `dest_len' - 1. Based on the above assumptions this can be extended. First instead of size_t to return ssize_t, so functions can return -1 and set errno accordingly. Eric Sanchis in his proposal does it a bit different because in his functions adds an extra argument as size_t, that uses this to control the behavior of the function (what it will do in the case that destination length is less than source len). He uses an int as a returned value which either is 0/1 on succesful operation, the following: #define OKNOTRUNC 0 /* copy/concatenation performed without truncation */ #define OKTRUNC 1 /* copy/concatenation performed with truncation */ And below is the extra information passed as fifth argument: #define TRUNC 0 /* truncation allowed */ #define NOTRUNC 1 /* truncation not allowed */ In the case of an error, returns > 0 which is either: #define EDSTPAR -1 /* Error : bad dst parameters */ #define ESRCPAR -2 /* Error : bad src parameters */ #define EMODPAR -3 /* Error : bad mode parameter */ #define ETRUNC -4 /* Error : not enough space to copy/concatenate and truncation not allowed */ Now combining all this and if the assumptions are correct, gnulib can return ssize_t and uses this to make it's functions to work up to SIZE_MAX and uses either Eric's interface or to set errno accordingly. But to me a function call like: str_cp (dest, memsize_of_dest, src, memsize_of_dest - 1) is quite common C's way to do things, plus we have a way to catch truncation and not to go out of bounds at the same time. Of course such operations are tied with malloc(). I've read the gnulib document yesteday and i saw that gnulib wraps malloc() with a function that (quite logically) aborts execution and even allows to set a callback function. In my humble opinion there is also the choise to choose reallocarray() from OpenBSD, which always checks for integer overflows with the following way: #define MUL_NO_OVERFLOW ((size_t) 1 << (sizeof (size_t) * 4)) #define MEM_IS_INT_OVERFLOW(nmemb, ssize) \ (((nmemb) >= MUL_NO_OVERFLOW || (ssize) >= MUL_NO_OVERFLOW) && \ (nmemb) > 0 && SIZE_MAX / (nmemb) < (ssize)) Now, you also said to the abovementioned thread: >> So, what we would need is are functions char * substring (const char *string, size_t length); char * concatenated_string2 (const char *string1, size_t length1, const char *string2, size_t length2); char * concatenated_string3 (const char *string1, size_t length1, const char *string2, size_t length2, const char *string3, size_t length3); ... >> where the length arguments are set to SIZE_MAX to designate the entire string. But exactly this why a string_buffer is preffered in many occations like these, plus also it has in constant time access to the byte length. > > An extended ustring (unicode|utf8) type can include information for its bytes with > > character semantics, like: > > (utf8 typedef'ed as signed int) > > utf8 code; // the integer representation > > int len; // the number of the needed bytes > > int width; // the number of the occupied cells > > char buf[5]; // and probably the character representation > > Such a type would have a niche use, IMO, because > - 99% of the processing would not need to access the width (screen columns) - so > why spend CPU time and RAM to store it and keep it up-to-date? > - 80% of the processing does not care about the Unicode code points either, > and libraries like libunistring can do the Unicode-aware processing. Of course is specialized but it's not uncommon those functions/operations as many need this information. And i forgot also to include utf8 validation. In that case as unfortunately there is not a way in C to exclude or include fields in a structure and since i'm talking here mostly for the functionality, rather a specific type and since you mentioned libunistring, perhaps would be wise to offer this functionality in gnulib (like you do for iconv and readline). But really the level of abstraction that maybe will help C developers is mostly something (very simplified) like this: inline long fget_size (FILE *); implemented (probably) as: long cur_p = ftell (fp); fseek (fp, 0, SEEK_END); size = ftell (fp); fseek (fp, cur_p, SEEK_SET); return size; There is no penalty here, it just will be a common way and expected way to do things. Maybe then writting and reading code in C will be much more enjoyable and C can be considered as an expressional language. But all this needs a standard. Perhaps gnulib can lead those. > Bruno Best, Αγαθοκλής ¹. https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00004.html