From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <bug-gnulib-bounces+normalperson=yhbt.net@gnu.org>
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN: AS22989 209.51.188.0/24
X-Spam-Status: No, score=-2.2 required=3.0 tests=BAYES_00,BODY_8BITS,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,
	FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,SPF_HELO_NONE,SPF_PASS shortcircuit=no autolearn=ham
	autolearn_force=no version=3.4.2
Received: from lists.gnu.org (lists.gnu.org [209.51.188.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by dcvr.yhbt.net (Postfix) with ESMTPS id 3A3EB1F463
	for <normalperson@yhbt.net>; Sat, 28 Dec 2019 14:56:14 +0000 (UTC)
Received: from localhost ([::1]:43896 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <bug-gnulib-bounces+normalperson=yhbt.net@gnu.org>)
	id 1ilDVk-0006w4-Ec
	for normalperson@yhbt.net; Sat, 28 Dec 2019 09:56:12 -0500
Received: from eggs.gnu.org ([2001:470:142:3::10]:42259)
 by lists.gnu.org with esmtp (Exim 4.90_1)
 (envelope-from <aga.chatzimanikas@gmail.com>) id 1ilBv9-00016Z-4o
 for bug-gnulib@gnu.org; Sat, 28 Dec 2019 08:14:20 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <aga.chatzimanikas@gmail.com>) id 1ilBv7-0003fE-9N
 for bug-gnulib@gnu.org; Sat, 28 Dec 2019 08:14:18 -0500
Received: from mail-ed1-x541.google.com ([2a00:1450:4864:20::541]:38700)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
 (Exim 4.71) (envelope-from <aga.chatzimanikas@gmail.com>)
 id 1ilBv6-0003Yj-R6
 for bug-gnulib@gnu.org; Sat, 28 Dec 2019 08:14:17 -0500
Received: by mail-ed1-x541.google.com with SMTP id i16so27830139edr.5
 for <bug-gnulib@gnu.org>; Sat, 28 Dec 2019 05:14:16 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-disposition:content-transfer-encoding:in-reply-to
 :user-agent; bh=s68uAKEo9e9jKZ9yrAPRzyTtj+xd0W6xGHhhx94LCHQ=;
 b=WAThnZm/06BzYxv++1XGq8ytCzHPmmhbIX/LAUj8pdUCFBMkHWPN15dRMYze1nW9d6
 GdqMJ5yUgFpNAzBgqmiKQ0uLzHcr6OX1i6tf85kv4k/ZprY24lE6Xr4Zd2Ub2jbRF50e
 mvlNcvni1L9NXJjZ1YZMxDop/JXfCugciMlWk6V9EqE87srh6iLbkukYrNZJDOXNpnSD
 aJFp9iUFQC2IGUNyOa9gAK0p7az0JOI2zZXjdL+BzUj70b3WRh1o3BoYSJMDudk76zqj
 T6STtJQdh3EA8DnMEOYGZjpXuq+t4M67PVEI2Pul4F402UT89QMe0bMf1pqYa54yR687
 jbPA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:references
 :mime-version:content-disposition:content-transfer-encoding
 :in-reply-to:user-agent;
 bh=s68uAKEo9e9jKZ9yrAPRzyTtj+xd0W6xGHhhx94LCHQ=;
 b=M97+HAXn8vDiqhkGwImtIH07osRYHQcKQqfVDBFYQxoNK2hEpGA2BxNkzVQANh99fQ
 2h/xyDi7h4IA1Z+5vI6JPgviqc8xCur6iSCBCyzc8lv6BKEvWg9z1Z6m+uFHP36ZayTA
 bKgVLoj64jAOkgf5mPgUrocyM61JCgIkapRCh5hMf/SKrZgM3chsbC+saPMgPx2v0Tjt
 uOIXHAttrJSwlP8DiPVJvn5BS5R55kayvSZuiRI0rg4DKecIO+WbVmfGVflB8d21h/Hy
 TQHoY/oJY8IavaB0HAh8LDuGkgWRsorVHheqBZKuWH+6hnNgdjjGgEdMnx/Fy51hDdNc
 aMBw==
X-Gm-Message-State: APjAAAVJPcFIVt97DNwn50RDDAqiAP3D58J9oWR7CziHEQ0oHS9i0Fgw
 cpisyGafObObFDVcbTmRoOWzjXIB+Qc=
X-Google-Smtp-Source: APXvYqzI2Xp445abCHqSaX7xxaSyjM35hupl2W8cS6Za9e1vIbntTe454UUUrqqMUNyo+GBzvJTeKg==
X-Received: by 2002:a17:906:3195:: with SMTP id
 21mr60394553ejy.207.1577538854875; 
 Sat, 28 Dec 2019 05:14:14 -0800 (PST)
Received: from localhost (ppp046176149051.access.hol.gr. [46.176.149.51])
 by smtp.googlemail.com with ESMTPSA id 2sm4431483edv.87.2019.12.28.05.14.12
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Sat, 28 Dec 2019 05:14:13 -0800 (PST)
Date: Sat, 28 Dec 2019 15:14:38 +0200
From: ag <aga.chatzimanikas@gmail.com>
To: bug-gnulib@gnu.org, Bruno Haible <bruno@clisp.org>,
 Paul Eggert <eggert@cs.ucla.edu>
Subject: Re: string types
Message-ID: <20191228131438.GA797@HATZ>
References: <175192568.e2XXTFFdkW@omega>
 <ab16546a-1318-331e-832b-656fa5a78a1e@gmx.de>
 <20191226221225.GA800@HATZ> <2179574.G9OhZXe8sF@omega>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <2179574.G9OhZXe8sF@omega>
User-Agent: Mutt/1.12.1 (2019-06-15)
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
 recognized.
X-Received-From: 2a00:1450:4864:20::541
X-Mailman-Approved-At: Sat, 28 Dec 2019 09:56:08 -0500
X-BeenThere: bug-gnulib@gnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Gnulib discussion list <bug-gnulib.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/bug-gnulib>,
 <mailto:bug-gnulib-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/bug-gnulib>
List-Post: <mailto:bug-gnulib@gnu.org>
List-Help: <mailto:bug-gnulib-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/bug-gnulib>,
 <mailto:bug-gnulib-request@gnu.org?subject=subscribe>
Cc: Tim =?utf-8?Q?R=C3=BChsen?= <tim.ruehsen@gmx.de>
Errors-To: bug-gnulib-bounces+normalperson=yhbt.net@gnu.org
Sender: "bug-gnulib" <bug-gnulib-bounces+normalperson=yhbt.net@gnu.org>

Hi,

On Fri, Dec 27, at 11:51 Bruno Haible wrote:
>  - providing primitives for string allocation reduces the amount of buffer
>    overflow bugs that otherwise occur in this area. [1]

[1] Re: string allocation
https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html

Thanks, i remember this thread, though at the time i couldn't understand some bits.

>> ag wrote:
> > ... to the actual algorithm (usually conditions that can or can't be met).

> That is the idea behind the container types (list, map) in gnulib. However, I don't
> see how to reasonably transpose this principle to string types.

Ok, let us try, so allow me to summarize with some of (my unqualified) assumptions
(please correct):

  - glibc malloc can request at most PTRDIFF_MAX

  - PTRDIFF_MAX is at least INT_MAX and at most SIZE_MAX
    (PTRDIFF_MAX is INT_MAX in 32bit)

  - SIZE_MAX as (size_t) (-1)

  - ssize_t (s means signed?) can be as big as SIZE_MAX? and SSIZE_MAX equals to
    SIZE_MAX?

  - the returned value of the *printf family of functions dictates their
    limits/range, as they return an int, this can be as INT_MAX mostly

Some concerns:

  - truncation errors should be caught

  - memory checkers should catch overflows

  - as since there is a "risk"¹ that someone has to take at some point (either the
    programmer or the underlying library code (as strdup() does)), the designed
    interface should lower those risks

There is a proposal from Eric Sanchis to Austin group at 9 Jun 2016, for a String
copy/concatenation interface, that his functions have both the allocated size and
the number of bytes to be written as arguments (some i will inline them here, since
i was unable to find his mail in the Posix mailing list archives).

I used this as a basis (as it was rather intuitive and perfectly suited for C), to
implement my own str_cp, which goes like this:

size_t str_cp (char *dest, size_t dest_len, const char *src, size_t nelem) {
  size_t num = (nelem > (dest_len - 1) ? dest_len - 1 : nelem);
  size_t len = (NULL is src ? 0 : byte_cp (dest, src, num));
  dest[len] = '\0';
  return len;
}

size_t byte_cp (char *dest, const char *src, size_t nelem) {
  const char *sp = src;
  size_t len = 0;

  while (len < nelem and *sp) {
    dest[len] = *sp++;
    len++;
  }

  return len;
}

Of course it can be done better, but here we have a low level function (byte_cp),
that does only the required checks and which returns the actual bytes written to
`dest', while str_cp checks if `src' is NULL and if `nelem' is bigger than `dest_len'
(if it is then copies at least `dest_len' - 1). It returns 0 or the actual written
bytes.

Since this returns the actual bytes written, it is up to the programmer to check
if truncation happened, but there is no possibility to copy more than `dest_len' - 1.

Based on the above assumptions this can be extended. First instead of size_t to
return ssize_t, so functions can return -1 and set errno accordingly.

Eric Sanchis in his proposal does it a bit different because in his functions adds
an extra argument as size_t, that uses this to control the behavior of the function
(what it will do in the case that destination length is less than source len).

He uses an int as a returned value which either is 0/1 on succesful operation, the
following:
#define   OKNOTRUNC  0		/* copy/concatenation performed without truncation */
#define   OKTRUNC    1		/* copy/concatenation performed with truncation */

And below is the extra information passed as fifth argument:
#define   TRUNC      0		/* truncation allowed */
#define   NOTRUNC    1		/* truncation not allowed */

In the case of an error, returns > 0 which is either:
#define   EDSTPAR   -1		/* Error : bad dst parameters */
#define   ESRCPAR   -2		/* Error : bad src parameters */
#define   EMODPAR   -3		/* Error : bad mode parameter */
#define   ETRUNC    -4		/* Error : not enough space to copy/concatenate
							   and truncation not allowed */

Now combining all this and if the assumptions are correct, gnulib can return
ssize_t and uses this to make it's functions to work up to SIZE_MAX and uses
either Eric's interface or to set errno accordingly.

But to me a function call like:
  str_cp (dest, memsize_of_dest, src, memsize_of_dest - 1)
is quite common C's way to do things, plus we have a way to catch truncation and
not to go out of bounds at the same time.

Of course such operations are tied with malloc().
I've read the gnulib document yesteday and i saw that gnulib wraps malloc() with a
function that (quite logically) aborts execution and even allows to set a callback
function.

In my humble opinion there is also the choise to choose reallocarray() from OpenBSD,
which always checks for integer overflows with the following way:

#define MUL_NO_OVERFLOW ((size_t) 1 << (sizeof (size_t) * 4))
#define MEM_IS_INT_OVERFLOW(nmemb, ssize)                             \
 (((nmemb) >= MUL_NO_OVERFLOW || (ssize) >= MUL_NO_OVERFLOW) &&       \
  (nmemb) > 0 && SIZE_MAX / (nmemb) < (ssize))


Now, you also said to the abovementioned thread:

>> So, what we would need is are functions

    char * substring (const char *string, size_t length);
    char * concatenated_string2 (const char *string1, size_t length1,
                                 const char *string2, size_t length2);
    char * concatenated_string3 (const char *string1, size_t length1,
                                 const char *string2, size_t length2,
                                 const char *string3, size_t length3);
    ...

>> where the length arguments are set to SIZE_MAX to designate the entire
 string.

But exactly this why a string_buffer is preffered in many occations like these,
plus also it has in constant time access to the byte length.

> > An extended ustring (unicode|utf8) type can include information for its bytes with
> > character semantics, like:
> >  (utf8 typedef'ed as signed int)
> >   utf8 code;   // the integer representation
> >   int len;     // the number of the needed bytes
> >   int width;   // the number of the occupied cells
> >   char buf[5]; // and probably the character representation
>
> Such a type would have a niche use, IMO, because
>   - 99% of the processing would not need to access the width (screen columns) - so
>     why spend CPU time and RAM to store it and keep it up-to-date?
>   - 80% of the processing does not care about the Unicode code points either,
>     and libraries like libunistring can do the Unicode-aware processing.

Of course is specialized but it's not uncommon those functions/operations as many
need this information. And i forgot also to include utf8 validation.
In that case as unfortunately there is not a way in C to exclude or include fields
in a structure and since i'm talking here mostly for the functionality, rather a
specific type and since you mentioned libunistring, perhaps would be wise to offer
this functionality in gnulib (like you do for iconv and readline).

But really the level of abstraction that maybe will help C developers is mostly
something (very simplified) like this:

inline long fget_size (FILE *);

implemented (probably) as:

long cur_p = ftell (fp);
fseek (fp, 0, SEEK_END);
size = ftell (fp);
fseek (fp, cur_p, SEEK_SET);
return size;

There is no penalty here, it just will be a common way and expected way to do things.
Maybe then writting and reading code in C will be much more enjoyable and C can be
considered as an expressional language.

But all this needs a standard. Perhaps gnulib can lead those.

> Bruno

Best,
  Αγαθοκλής

¹. https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00004.html