Re: string types

From: ag <aga.chatzimanikas@gmail.com>
To: bug-gnulib@gnu.org, Bruno Haible <bruno@clisp.org>,
	Paul Eggert <eggert@cs.ucla.edu>
Cc: "Tim Rühsen" <tim.ruehsen@gmx.de>
Subject: Re: string types
Date: Sat, 28 Dec 2019 15:14:38 +0200	[thread overview]
Message-ID: <20191228131438.GA797@HATZ> (raw)
In-Reply-To: <2179574.G9OhZXe8sF@omega>

Hi,

On Fri, Dec 27, at 11:51 Bruno Haible wrote:
>  - providing primitives for string allocation reduces the amount of buffer
>    overflow bugs that otherwise occur in this area. [1]

[1] Re: string allocation
https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html

Thanks, i remember this thread, though at the time i couldn't understand some bits.

>> ag wrote:
> > ... to the actual algorithm (usually conditions that can or can't be met).

> That is the idea behind the container types (list, map) in gnulib. However, I don't
> see how to reasonably transpose this principle to string types.

Ok, let us try, so allow me to summarize with some of (my unqualified) assumptions
(please correct):

  - glibc malloc can request at most PTRDIFF_MAX

  - PTRDIFF_MAX is at least INT_MAX and at most SIZE_MAX
    (PTRDIFF_MAX is INT_MAX in 32bit)

  - SIZE_MAX as (size_t) (-1)

  - ssize_t (s means signed?) can be as big as SIZE_MAX? and SSIZE_MAX equals to
    SIZE_MAX?

  - the returned value of the *printf family of functions dictates their
    limits/range, as they return an int, this can be as INT_MAX mostly

Some concerns:

  - truncation errors should be caught

  - memory checkers should catch overflows

  - as since there is a "risk"¹ that someone has to take at some point (either the
    programmer or the underlying library code (as strdup() does)), the designed
    interface should lower those risks

There is a proposal from Eric Sanchis to Austin group at 9 Jun 2016, for a String
copy/concatenation interface, that his functions have both the allocated size and
the number of bytes to be written as arguments (some i will inline them here, since
i was unable to find his mail in the Posix mailing list archives).

I used this as a basis (as it was rather intuitive and perfectly suited for C), to
implement my own str_cp, which goes like this:

size_t str_cp (char *dest, size_t dest_len, const char *src, size_t nelem) {
  size_t num = (nelem > (dest_len - 1) ? dest_len - 1 : nelem);
  size_t len = (NULL is src ? 0 : byte_cp (dest, src, num));
  dest[len] = '\0';
  return len;
}

size_t byte_cp (char *dest, const char *src, size_t nelem) {
  const char *sp = src;
  size_t len = 0;

  while (len < nelem and *sp) {
    dest[len] = *sp++;
    len++;
  }

  return len;
}

Of course it can be done better, but here we have a low level function (byte_cp),
that does only the required checks and which returns the actual bytes written to
`dest', while str_cp checks if `src' is NULL and if `nelem' is bigger than `dest_len'
(if it is then copies at least `dest_len' - 1). It returns 0 or the actual written
bytes.

Since this returns the actual bytes written, it is up to the programmer to check
if truncation happened, but there is no possibility to copy more than `dest_len' - 1.

Based on the above assumptions this can be extended. First instead of size_t to
return ssize_t, so functions can return -1 and set errno accordingly.

Eric Sanchis in his proposal does it a bit different because in his functions adds
an extra argument as size_t, that uses this to control the behavior of the function
(what it will do in the case that destination length is less than source len).

He uses an int as a returned value which either is 0/1 on succesful operation, the
following:
#define   OKNOTRUNC  0		/* copy/concatenation performed without truncation */
#define   OKTRUNC    1		/* copy/concatenation performed with truncation */

And below is the extra information passed as fifth argument:
#define   TRUNC      0		/* truncation allowed */
#define   NOTRUNC    1		/* truncation not allowed */

In the case of an error, returns > 0 which is either:
#define   EDSTPAR   -1		/* Error : bad dst parameters */
#define   ESRCPAR   -2		/* Error : bad src parameters */
#define   EMODPAR   -3		/* Error : bad mode parameter */
#define   ETRUNC    -4		/* Error : not enough space to copy/concatenate
							   and truncation not allowed */

Now combining all this and if the assumptions are correct, gnulib can return
ssize_t and uses this to make it's functions to work up to SIZE_MAX and uses
either Eric's interface or to set errno accordingly.

But to me a function call like:
  str_cp (dest, memsize_of_dest, src, memsize_of_dest - 1)
is quite common C's way to do things, plus we have a way to catch truncation and
not to go out of bounds at the same time.

Of course such operations are tied with malloc().
I've read the gnulib document yesteday and i saw that gnulib wraps malloc() with a
function that (quite logically) aborts execution and even allows to set a callback
function.

In my humble opinion there is also the choise to choose reallocarray() from OpenBSD,
which always checks for integer overflows with the following way:

#define MUL_NO_OVERFLOW ((size_t) 1 << (sizeof (size_t) * 4))
#define MEM_IS_INT_OVERFLOW(nmemb, ssize)                             \
 (((nmemb) >= MUL_NO_OVERFLOW || (ssize) >= MUL_NO_OVERFLOW) &&       \
  (nmemb) > 0 && SIZE_MAX / (nmemb) < (ssize))

Now, you also said to the abovementioned thread:

>> So, what we would need is are functions

    char * substring (const char *string, size_t length);
    char * concatenated_string2 (const char *string1, size_t length1,
                                 const char *string2, size_t length2);
    char * concatenated_string3 (const char *string1, size_t length1,
                                 const char *string2, size_t length2,
                                 const char *string3, size_t length3);
    ...

>> where the length arguments are set to SIZE_MAX to designate the entire
 string.

But exactly this why a string_buffer is preffered in many occations like these,
plus also it has in constant time access to the byte length.

> > An extended ustring (unicode|utf8) type can include information for its bytes with
> > character semantics, like:
> >  (utf8 typedef'ed as signed int)
> >   utf8 code;   // the integer representation
> >   int len;     // the number of the needed bytes
> >   int width;   // the number of the occupied cells
> >   char buf[5]; // and probably the character representation
>
> Such a type would have a niche use, IMO, because
>   - 99% of the processing would not need to access the width (screen columns) - so
>     why spend CPU time and RAM to store it and keep it up-to-date?
>   - 80% of the processing does not care about the Unicode code points either,
>     and libraries like libunistring can do the Unicode-aware processing.

Of course is specialized but it's not uncommon those functions/operations as many
need this information. And i forgot also to include utf8 validation.
In that case as unfortunately there is not a way in C to exclude or include fields
in a structure and since i'm talking here mostly for the functionality, rather a
specific type and since you mentioned libunistring, perhaps would be wise to offer
this functionality in gnulib (like you do for iconv and readline).

But really the level of abstraction that maybe will help C developers is mostly
something (very simplified) like this:

inline long fget_size (FILE *);

implemented (probably) as:

long cur_p = ftell (fp);
fseek (fp, 0, SEEK_END);
size = ftell (fp);
fseek (fp, cur_p, SEEK_SET);
return size;

There is no penalty here, it just will be a common way and expected way to do things.
Maybe then writting and reading code in C will be much more enjoyable and C can be
considered as an expressional language.

But all this needs a standard. Perhaps gnulib can lead those.

> Bruno

Best,
  Αγαθοκλής

¹. https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00004.html