Re: size_t vs long. - Alejandro Colomar via Libc-alpha

unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Alejandro Colomar via Libc-alpha <libc-alpha@sourceware.org>
To: Paul Eggert <eggert@cs.ucla.edu>, A <amit234234234234@gmail.com>,
	libc-alpha@sourceware.org
Cc: gcc@gcc.gnu.org
Subject: Re: size_t vs long.
Date: Fri, 18 Nov 2022 00:04:12 +0100	[thread overview]
Message-ID: <683baaee-f3dc-bc13-c303-8fb0df0d0a36@gmail.com> (raw)
In-Reply-To: <27229b18-673b-d038-9a4c-c32c50ca547c@cs.ucla.edu>

[-- Attachment #1.1: Type: text/plain, Size: 9017 bytes --]

Hi Paul,

On 11/17/22 22:39, Paul Eggert wrote:
>>> Second and more important, that code is bogus. Nobody should ever write code 
>>> like that. If I wrote code like that, I'd *want* a trap.
>>
>> for (size_t i = 41; i < sizeof A / sizeof A[0]; --i) {
>>    A[i] = something_nice;
>> }
>>
>> The code above seems a bug by not being used to it.  Once you get used to it, 
>> it can become natural, but let's go for the more natural:
>>
>>
>> for (size_t i = 0; i < sizeof A / sizeof A[0]; ++i) {
>>    A[i] = something_nice;
>> } 
> 
> Those loops do not mean the same thing.

Sorry, I didn't mean that they are the same.  For code that means exactly the 
same as

for (size_t i = 41; i < nitems(A); --i) {
     A[i] = something_nice;
}

we need:

for (size_t i = 0; i < nitems(A) && i <= 41; ++i) {
     A[i] = something_nice;
}

or

for (idx_t i = 0; i < nitems(A) && i <= 41; ++i) {
     A[i] = something_nice;
}

or

for (idx_t i = 41; i < nitems(A) && i > 0; --i) {
     A[i] = something_nice;
}

(always assuming SIZE_MAX > INT_MAX.)

Without more context, I can't know if any of them are bogus.  Of course, that 41 
should normally come from a variable instead of a magic number.

We can see that Jens's code is at least simpler, which is a bonus point.

But normally, we don't enter a loop from a random entry value, but rather one of 
its extremes, which is what I showed in my alternative example.  Let's also show 
the alternative forms we can write it:

for (size_t i = 0; i < nitems(A); ++i) {
     A[i] = something_nice;
}

is equivalent to:

for (size_t i = nitems(A) - 1; i < nitems(A); --i) {
     A[i] = something_nice;
}

or

for (idx_t i = nitems(A) - 1; i > 0; --i) {
     A[i] = something_nice;
}

or

for (idx_t i = 0; i < nitems(A); ++i) {
     A[i] = something_nice;
}

There's not much difference in this case regarding readability.

If 'i' is modified within the loop (apart from the obvious --i_, then we need to 
check both bounds of the array, which with size_t comes for free; with idx_t you 
need to add code:

for (idx_t i = nitems(A) - 1; i < nitems(A) && i > 0; --i) {
     A[i] = something_nice;
     i += foo;
}

or

for (idx_t i = 0; i < nitems(A) && i > 0; ++i) {
     A[i] = something_nice;
     i += foo;
}

vs

for (size_t i = 0; i < nitems(A); ++i) {
     A[i] = something_nice;
     i += foo;
}

or

for (size_t i = nitems(A) - 1; i < nitems(A); --i) {
     A[i] = something_nice;
     i += foo;
}

Again, size_t seems to win in simplicity.

> The first is bogus; the second one is OK 
> (notice, the bogus loop has a "41", the OK loop doesn't).
> 
> I'm not surprised you didn't notice how bogus the first loop was - most people 
> wouldn't notice it either. And it's Gustedt's main point! I don't know why he 
> went off the rails with that overly-clever code, but he did.

I still don't know what was the intended bug.  Or why it would differ from the 
idx_t versions.  Please detail.

> 
> 
>> The main advantage of this code compared to the equivalent ssize_t or 
>> ptrdiff_t or idx_t code is that if you somehow write an off-by-one error, and 
>> manage to access the array at [-1], if i is unsigned you'll access [SIZE_MAX], 
>> which will definitely crash your program.
> 
> That's not true on the vast majority of today's platforms, which don't have 
> subscript checking, and for which a[-1] is treated the same way a[SIZE_MAX] is. 
> On my platform (Fedora 36 x86-64) the same machine code is generated for 'a' and 
> 'b' for the following C code.
> 
>    #include <stdint.h>
>    int a(int *p) { return p[-1]; }
>    int b(int *p) { return p[SIZE_MAX]; }

Hmm, this seems to be true in my platform (amd64) per the experiment I just did:

$ cat s.c
#include <sys/types.h>

char
f(char *p, ssize_t i)
{
	return p[i];
}
$ cat u.c
#include <stddef.h>

char
f(char *p, size_t i)
{
	return p[i];
}
$ cc -Wall -Wextra -Werror -S -O3 s.c u.c
$ diff -u u.s s.s
--- u.s	2022-11-17 23:41:47.773805041 +0100
+++ s.s	2022-11-17 23:41:47.761805265 +0100
@@ -1,15 +1,15 @@
-	.file	"u.c"
+	.file	"s.c"
  	.text
  	.p2align 4
  	.globl	f
  	.type	f, @function
  f:
-.LFB0:
+.LFB6:
  	.cfi_startproc
  	movzbl	(%rdi,%rsi), %eax
  	ret
  	.cfi_endproc
-.LFE0:
+.LFE6:
  	.size	f, .-f
  	.ident	"GCC: (Debian 12.2.0-9) 12.2.0"
  	.section	.note.GNU-stack,"",@progbits

It seems a violation of the standard, isn't it?

The operator [] doesn't have a type, and an argument to it should be treated 
with whatever type it has after default promotions.  If I pass a size_t to it, 
the type should be unsigned, and that should be preserved, by accessing the 
array at a high value, which the compiler has no way to know if it will exist or 
not, by that function definition.  The extreme of -1 and SIZE_MAX might be not 
the best one, since we would need a pointer to be 0 to be accessible at 
[SIZE_MAX], but if you replace those by -RANDOM, and (size_t)-RANDOM, then the 
compiler definitely needs to generate different code, yet it doesn't.

I'm guessing this is an optimization by GCC knowing that we will never be close 
to using the whole 64-bit address space.  If we use int and unsigned, things change:

$ cat s.c
char
f(char *p, int i)
{
	return p[i];
}
alx@asus5775:~/tmp$ cat u.c
char
f(char *p, unsigned i)
{
	return p[i];
}
$ cc -Wall -Wextra -Werror -S -O3 s.c u.c
$ diff -u u.s s.s
--- u.s	2022-11-17 23:44:54.446318186 +0100
+++ s.s	2022-11-17 23:44:54.434318409 +0100
@@ -1,4 +1,4 @@
-	.file	"u.c"
+	.file	"s.c"
  	.text
  	.p2align 4
  	.globl	f
@@ -6,7 +6,7 @@
  f:
  .LFB0:
  	.cfi_startproc
-	movl	%esi, %esi
+	movslq	%esi, %rsi
  	movzbl	(%rdi,%rsi), %eax
  	ret
  	.cfi_endproc

I'm guessing that GCC doesn't do the assumption here, and I guess the unsigned 
version would crash, while the signed version would cause nasal demons.  Anyway, 
now that I'm here, I'll test it:

$ cat s.c
[[gnu::noipa]]
char
f(char *p, int i)
{
	return p[i];
}

int main(void)
{
	int i = -1;
	char c[4];

	return f(c, i);
}
$ cc -Wall -Wextra -Werror -O3 s.c
$ ./a.out
$ echo $?
0

$ cat u.c
[[gnu::noipa]]
char
f(char *p, unsigned i)
{
	return p[i];
}

int main(void)
{
	unsigned i = -1;
	char c[4];

	return f(c, i);
}
$ cc -Wall -Wextra -Werror -O3 u.c
$ ./a.out
Segmentation fault

I get this SEGV difference consistently.  I CCed gcc@ in case they consider this 
to be something they want to address.  Maybe the optimization is important for 
size_t-sized indices, but if it is not, I'd prefer getting the SEGV for SIZE_MAX.

> 
> Yes, debugging implementations might catch p[SIZE_MAX], but the ones that do 
> will likely catch p[-1] as well.
> 
> In short, there's little advantage to using size_t for indexes, and there are 
> real disadvantages due to comparison confusion and lack of signed integer 
> overflow checking.
> 
> 
>>> First, Gustedt technically incorrect, because the code *can* trap on 
>>> platforms where SIZE_MAX <= INT_MAX,
> 
>> I honestly don't know of any existing platforms where that is true
> 
> They're a dying breed. The main problem from my point of view is that C and 
> POSIX allow these oddballs, so if you want to write really portable code you 
> have to worry about them - and this understadably discourages people from 
> writing really portable code. (What's the point of coding to the standards if 
> it's just a bunch of make-work?)

I understand your point, since you work on highly-portable code.

But there's always a tradeoff, and I'd very much like that the standards didn't 
allow such oddballs.  They're cleaning up with C23, so it seems to be going in 
the good direction.  I remember this discussion we had with ILP64.  I hope in 
the future those oddballs get reduced considerably.  Luckily, SIZE_MAX>INT_MAX 
seems to be quite more uncommon than ILP64.

> 
> Anyway, one example is Unisys Clearpath C, in which INT_MAX and SIZE_MAX both 
> equal 2**39 - 1.

Lol.  It would have been fun to see it be 2**42 - 1.

> This is allowed by the current POSIX and C standards, and this 
> compiler is still for sale and supported. (I doubt whether they'll port it to 
> C23, so there's that....)

Heh, I now don't remember the name of this compiler that attempted to implement 
UB in the most unexpected ways on purpose just for fun, which was itself an 
interesting experiment to check portability of a given code.  It reminds me of that.

> 
> 
>> C23 will require that signed integers are 2's complement, which I guess 
>> removes the possibility of a trap
> 
> It doesn't remove the possibility, since signed integers can have trap 
> representations. But we are straying from the more important point.
> 

Cheers,

Alex

-- 
<http://www.alejandro-colomar.es/>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

next prev parent reply	other threads:[~2022-11-17 23:04 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-17  7:02 size_t vs long A via Libc-alpha
2022-11-17  9:21 ` Alejandro Colomar via Libc-alpha
2022-11-17  9:48   ` A via Libc-alpha
2022-11-17 11:00     ` Alejandro Colomar via Libc-alpha
2022-11-17 19:40       ` Jason Duerstock via Libc-alpha
2022-11-17 20:01         ` Alejandro Colomar via Libc-alpha
2022-11-17 19:17   ` Paul Eggert
2022-11-17 20:27     ` Alejandro Colomar via Libc-alpha
2022-11-17 21:39       ` Paul Eggert
2022-11-17 23:04         ` Alejandro Colomar via Libc-alpha [this message]
2022-11-23 20:08           ` Using size_t to crash on off-by-one errors (was: size_t vs long.) Alejandro Colomar via Libc-alpha
2022-11-18  2:11         ` size_t vs long Maciej W. Rozycki
2022-11-18  2:47           ` Paul Eggert
2022-11-23 20:01             ` Alejandro Colomar via Libc-alpha
2022-11-17 21:58 ` DJ Delorie via Libc-alpha

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/libc/involved.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=683baaee-f3dc-bc13-c303-8fb0df0d0a36@gmail.com \
    --to=libc-alpha@sourceware.org \
    --cc=alx.manpages@gmail.com \
    --cc=amit234234234234@gmail.com \
    --cc=eggert@cs.ucla.edu \
    --cc=gcc@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).