bug-gnulib@gnu.org mirror (unofficial)
 help / color / mirror / Atom feed
* supporting strings > 2 GB
@ 2019-10-12 14:38 Bruno Haible
  2019-10-13  3:01 ` Paul Eggert
  0 siblings, 1 reply; 6+ messages in thread
From: Bruno Haible @ 2019-10-12 14:38 UTC (permalink / raw)
  To: bug-gnulib

Hi Paul, Eric,

I'd like to get over the INT_MAX limit on string size for
  * the *printf family of functions,
  * the wcswidth, mbswidth functions,
like it has been done for large files and regular expressions.

The benefit I expect from that is:
  - Support of strings > 2 GB or 4 GB without making applications more complex.
  - Since such strings occur rarely, these corner cases of the code are most
    often untested. The change would eliminate these untested corners, thus
    eliminating a number of bugs.

How was it done for regular expressions?
  1) POSIX introduced a type 'regoff_t' that is to be used instead of 'int',
     in the context of the regex APIs.
     https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/regex.h.html
  2) glibc introduced a preprocessor define _REGEX_LARGE_OFFSETS.
  3) gnulib defines _REGEX_LARGE_OFFSETS to 1.

In a similar vein, I think it could be done like this for *printf:
  1) Introduce a type 'printf_len_t' that is a signed type, either 'int' or
     'ptrdiff_t'. And a constant PRINTF_LEN_MAX accordingly.
  2) For each *printf functions that returns 'int', define a similar function
     *printfl, that returns 'printf_len_t'.
  3) Introduce %ln as a printf_len_t alternative to %n.
  4) If _PRINTF_LARGE is defined and non-zero, define xxxprintf as an alias
     of xxxprintfl (e.g. '#define xxxprintf xxxprintfl').
  5) Gnulib defines _PRINTF_LARGE to 1.

And similarly for wcswidth, with new function wclswidth and macro
_WCSWIDTH_LARGE.

This way, applications could switch from *printf to *printfl at their pace,
without introducing uncaught overflow bugs at any moment.

Has this already been discussed in the Austin Group, or on the glibc list?

Bruno



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: supporting strings > 2 GB
  2019-10-12 14:38 supporting strings > 2 GB Bruno Haible
@ 2019-10-13  3:01 ` Paul Eggert
  2019-10-13 17:38   ` Bruno Haible
  2019-10-13 19:50   ` Bruno Haible
  0 siblings, 2 replies; 6+ messages in thread
From: Paul Eggert @ 2019-10-13  3:01 UTC (permalink / raw)
  To: Bruno Haible; +Cc: bug-gnulib

On 10/12/19 7:38 AM, Bruno Haible wrote:

> Has this already been discussed in the Austin Group, or on the glibc list?

Not as far as I know, though I haven't read all those mailing lists. It would be 
a good thing to do.

I'm not sold on a new type 'printf_len_t' in the standard. Can't we get by with 
using ptrdiff_t instead? That would save standard C libraries the hassle of 
specifying a new length modifier and/or macros like PRIdPRINTF and SCNdPRINTF, 
for programs that want to print or read printf_len_t values.

Gnulib may need something like printf_len_t, PRIdPRINTF etc., but I don't quite 
see why POSIX and/or the C standard would need them.

>>    3) Introduce %ln as a printf_len_t alternative to %n.

Would %ln work only for the new *l functions, or would it also work for the 
already-standard printf functions?

How about the '*' field width? There needs to be some way to say that the field 
width is of type ptrdiff_t, not int. Would '**' stand for ptrdiff_t field widths?

Perhaps it would be simpler if the new *l functions use ptrdiff_t everywhere 
that the old functions use 'int' for sizes and widths. Then we wouldn't have to 
worry about '**' vs '*', or about '%ln' versus '%n'. The Gnulib layer could 
resolve whether the functions are about int or ptrdiff_t.

I assume functions like snprintfl would take ptrdiff_t arguments instead of 
size_t arguments for buffer sizes.

Basically, replace size_t and int with ptrdiff_t everywhere we can.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: supporting strings > 2 GB
  2019-10-13  3:01 ` Paul Eggert
@ 2019-10-13 17:38   ` Bruno Haible
  2019-10-13 18:32     ` Paul Eggert
  2019-10-13 19:50   ` Bruno Haible
  1 sibling, 1 reply; 6+ messages in thread
From: Bruno Haible @ 2019-10-13 17:38 UTC (permalink / raw)
  To: Paul Eggert; +Cc: bug-gnulib

Hi Paul,

> > Has this already been discussed in the Austin Group, or on the glibc list?
> 
> Not as far as I know, though I haven't read all those mailing lists. It would be 
> a good thing to do.

Thanks for the info. Then, on this topic, gnulib will be going ahead.

> I'm not sold on a new type 'printf_len_t' in the standard. Can't we get by with 
> using ptrdiff_t instead? That would save standard C libraries the hassle of 
> specifying a new length modifier and/or macros like PRIdPRINTF and SCNdPRINTF, 
> for programs that want to print or read printf_len_t values.

The type printf_len_t is meant to allow the user to write code that works with
and without _PRINTF_LARGE.

1) It would be wrong to write

     int ret = printf (...);

   because without _PRINTF_LARGE this code will truncate the printf result.

2) It would be wrong to write

     ptrdiff_t len;
     if (len > PTRDIFF_MAX)
       fail ();

   because without _PRINTF_LARGE this does not do the necessary checking. And

     ptrdiff_t len;
     if (len > INT_MAX)
       fail ();

   is wrong for the case that _PRINTF_LARGE is defined.

The type and macro allow to write these as

     printf_len_t ret = printf (...);

     printf_len_t len;
     if (len > PRINTF_LEN_MAX)
       fail ();

There is no need to reserve a new length modifier and/or macros like PRIdPRINTF
and SCNdPRINTF, because the type and macro are only a convenience.

> >>    3) Introduce %ln as a printf_len_t alternative to %n.
> 
> Would %ln work only for the new *l functions, or would it also work for the 
> already-standard printf functions?

The existing printf functions are left unchanged: Since the entire result
may not be longer than INT_MAX bytes, it makes no sense to add provisions
for returning an index > INT_MAX or using a format directive with width
or precision > INT_MAX.

> How about the '*' field width? There needs to be some way to say that the field 
> width is of type ptrdiff_t, not int. Would '**' stand for ptrdiff_t field widths?

Good point, yes: there ought to be a way to specify a field width or
precision as a ptrdiff_t. I think I'll prefer the syntax 'l*' to '**',
for consistency with %ln.

> Perhaps it would be simpler if the new *l functions use ptrdiff_t everywhere 
> that the old functions use 'int' for sizes and widths. Then we wouldn't have to 
> worry about '**' vs '*', or about '%ln' versus '%n'. The Gnulib layer could 
> resolve whether the functions are about int or ptrdiff_t.

But then the valid format strings for the *l functions would not be
a superset of the valid format strings for the existing *printf functions.
One of the goals is that programmers can use the new facility just be
importing the respective gnulib modules and doing
  #define _PRINTF_LARGE 1
without reviewing every format string.

> I assume functions like snprintfl would take ptrdiff_t arguments instead of 
> size_t arguments for buffer sizes.
> 
> Basically, replace size_t and int with ptrdiff_t everywhere we can.

Yes, this is the plan; thanks for the reminder about size_t.

Regarding the naming: I'm now tending towards 'lprintf' and 'flprintf',
to make it look like 'wprintf' and 'fwprintf'.

Bruno



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: supporting strings > 2 GB
  2019-10-13 17:38   ` Bruno Haible
@ 2019-10-13 18:32     ` Paul Eggert
  0 siblings, 0 replies; 6+ messages in thread
From: Paul Eggert @ 2019-10-13 18:32 UTC (permalink / raw)
  To: Bruno Haible; +Cc: bug-gnulib

On 10/13/19 10:38 AM, Bruno Haible wrote:

> The type printf_len_t is meant to allow the user to write code that works with
> and without _PRINTF_LARGE.

By "the user" do you mean a user of an improved POSIX API for printf-like 
functions, or a user of a Gnulib wrapper around the improved POSIX API? If the 
former, I'm not quite following. If the latter, then I do follow; but we need to 
make it clear which part of the change is the former and which is for the 
latter, if we ever want to change POSIX and/or ISO C.

> 1) It would be wrong to write
> 
>       int ret = printf (...);
> 
>     because without _PRINTF_LARGE this code will truncate the printf result.

For this particular case, portable code could use 'ptrdiff_t' instead of 'int'; 
this would be portable enough as it would work regardless of whether printf is 
old-style or new-style (except on weird platforms where PTRDIFF_MAX < INT_MAX, 
which I don't think we need to worry about).

> The type and macro allow to write these as
> 
>       printf_len_t ret = printf (...);
> 
>       printf_len_t len;
>       if (len > PRINTF_LEN_MAX)
>         fail ();

Sorry, I don't follow this. I thought PRINTF_LEN_MAX was intended to be the 
maximum value that can be stored into printf_len_t, in which case 'len > 
PRINTF_LEN_MAX' must yield 0. If the intent is something else, then these types 
and/or macros probably need different names, to avoid confusion with 
longstanding naming practice elsewhere.

> There is no need to reserve a new length modifier and/or macros like PRIdPRINTF
> and SCNdPRINTF, because the type and macro are only a convenience.

So if I want to print a printf_len_t I must first convert it to intmax_t and 
print that? I don't see the convenience here, but perhaps that's because I don't 
understand the intent of printf_len_t and PRINTF_LEN_MAX.

>> Would %ln work only for the new *l functions, or would it also work for the
>> already-standard printf functions?
> 
> The existing printf functions are left unchanged: Since the entire result
> may not be longer than INT_MAX bytes, it makes no sense to add provisions
> for returning an index > INT_MAX or using a format directive with width
> or precision > INT_MAX.

printf already has provisions for width or precision > INT_MAX; one can do 
'printf ("%2147483648d", 0)', for example. These calls are a corner case that 
fail, but that's OK. Attempting to use '**' with old printf could fail in a 
similar way.

>> Perhaps it would be simpler if the new *l functions use ptrdiff_t everywhere
>> that the old functions use 'int' for sizes and widths. Then we wouldn't have to
>> worry about '**' vs '*', or about '%ln' versus '%n'. The Gnulib layer could
>> resolve whether the functions are about int or ptrdiff_t.
> 
> But then the valid format strings for the *l functions would not be
> a superset of the valid format strings for the existing *printf functions.

Why a superset? Shouldn't the sets of format strings be the same, so that 
programmers can easily switch back and forth between the two sets of functions? 
For example, if you have code that generates a format string, it would be nicer 
if you could use that same format string regardless of whether you pass it to 
printf or to lprintf.

> One of the goals is that programmers can use the new facility just be
> importing the respective gnulib modules and doing
>    #define _PRINTF_LARGE 1
> without reviewing every format string.

Yes, and that goal is furthered by having the two sets of functions accept the 
same format strings.

> Regarding the naming: I'm now tending towards 'lprintf' and 'flprintf',
> to make it look like 'wprintf' and 'fwprintf'.

Yes, that sounds better than the first proposal.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: supporting strings > 2 GB
  2019-10-13  3:01 ` Paul Eggert
  2019-10-13 17:38   ` Bruno Haible
@ 2019-10-13 19:50   ` Bruno Haible
  2019-10-13 20:12     ` Paul Eggert
  1 sibling, 1 reply; 6+ messages in thread
From: Bruno Haible @ 2019-10-13 19:50 UTC (permalink / raw)
  To: Paul Eggert; +Cc: bug-gnulib

Hi Paul,

Probably I didn't explain it well. Let me try again.

> Gnulib may need something like printf_len_t, PRIdPRINTF etc., but I don't quite 
> see why POSIX and/or the C standard would need them.

The code will consist of two layers:

1) A layer that defines functions.
   Example:
     ptrdiff_t lprintf (const char *format, ...)
     _GL_ATTRIBUTE_FORMAT_PRINTF (1, 2);

2) A layer that may redefine functions and types through aliases.
   Example:
     #if _PRINTF_LARGE
       #undef printf
       #define printf lprintf
       #define printf_len_t ptrdiff_t
     #else
       #define printf_len_t int
     #endif

This is similar to how the large file support was implemented
in two layers:

1) A function
     off64_t lseek64(int fd, off64_t offset, int whence);

2) A layer that redefines functions and types:

     #if _FILE_OFFSET_BITS == 64
       #define lseek lseek64
       #define off_t off64_t
     #endif

The C or POSIX standards deal only with layer 1). However, layer 2) is
essential for programs, to make the use of the new APIs easy.

Bruno



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: supporting strings > 2 GB
  2019-10-13 19:50   ` Bruno Haible
@ 2019-10-13 20:12     ` Paul Eggert
  0 siblings, 0 replies; 6+ messages in thread
From: Paul Eggert @ 2019-10-13 20:12 UTC (permalink / raw)
  To: Bruno Haible; +Cc: bug-gnulib

On 10/13/19 12:50 PM, Bruno Haible wrote:
> The C or POSIX standards deal only with layer 1). However, layer 2) is
> essential for programs, to make the use of the new APIs easy.

Right, and I see the need for two layers. I'm still not seeing, though, the 
exact division between the two layers in this instance.

With large file support, POSIX took an old function lseek that used 'long', and 
said that lseek should use the new type 'off_t' instead. Old implementations 
could simply add 'typedef long off_t;' and conform. There is no OFF_MAX or 
PRIdOFF because off_t is not part of ISO C. Programs define _FILE_OFFSET_BITS to 
choose which off_t they get.

A difference here is that we'd be proposing a change to ISO C (it could be done 
only in POSIX, but it's really a change to the C standard). In ISO C there's a 
tradition that types like 'ptrdiff_t' all have macros like PTRDIFF_MAX, PRIdPTR, 
etc., and so presumably this tradition should apply to printf_len_t.

If we take this approach, there should be no need for %ln vs %n or %**d vs %*d; 
programs that define _PRINTF_LARGE will get a wide printf_len_t and things will 
"just work" if programs consistently use printf_len_t instead of int (and use 
the related macros too).


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-10-13 20:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-12 14:38 supporting strings > 2 GB Bruno Haible
2019-10-13  3:01 ` Paul Eggert
2019-10-13 17:38   ` Bruno Haible
2019-10-13 18:32     ` Paul Eggert
2019-10-13 19:50   ` Bruno Haible
2019-10-13 20:12     ` Paul Eggert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).