unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* size_t vs long.
@ 2022-11-17  7:02 A via Libc-alpha
  2022-11-17  9:21 ` Alejandro Colomar via Libc-alpha
  2022-11-17 21:58 ` DJ Delorie via Libc-alpha
  0 siblings, 2 replies; 15+ messages in thread
From: A via Libc-alpha @ 2022-11-17  7:02 UTC (permalink / raw)
  To: libc-alpha

Hi,

I prefer long over size_t.

This is because, in case the user passes a negative number by mistake
then I will be able to check it if the type is long and return
immediately.

But if size_t is used, then most probably, it will result in a crash -
like malloc(-1) will crash the program because unsigned -1 is
0xFFFFFFFF and this much memory is not available on today's computers
and probably may not be available at all in future also (RAM size of
2^64 bits is really really huge).

Another thing is that if size_t is used an array index then array[-1]
will result in wrong behavior or program crash. But with long, the
developer can check whether the index is negative, thus avoiding
program crash.

So, in my opinion, long should be used instead of size_t.

I know that original glibc authors had chosen size_t, so there must be
some reason for that, however that reason is not clear to me.

Amit

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: size_t vs long.
  2022-11-17  7:02 size_t vs long A via Libc-alpha
@ 2022-11-17  9:21 ` Alejandro Colomar via Libc-alpha
  2022-11-17  9:48   ` A via Libc-alpha
  2022-11-17 19:17   ` Paul Eggert
  2022-11-17 21:58 ` DJ Delorie via Libc-alpha
  1 sibling, 2 replies; 15+ messages in thread
From: Alejandro Colomar via Libc-alpha @ 2022-11-17  9:21 UTC (permalink / raw)
  To: A, libc-alpha


[-- Attachment #1.1: Type: text/plain, Size: 3207 bytes --]

Hello,

On 11/17/22 08:02, A via Libc-alpha wrote:
> Hi,
> 
> I prefer long over size_t.

'long'?  really?  'ptrdiff_t' could make sense, but 'long' is a very bad choice, 
IMO.  I whish it had never been invented.  What does 'long' mean, in a (ISO C) 
portable way?  Nothing.

'long' is just a type on which typedefs can be made, but it has no use on its 
own, except for a few places in libc where it's used because noone stopped to 
create a better typedef for the variable (e.g., timespec.tv_nsec, which would 
have been better with a typedef called nseconds_t).

> 
> This is because, in case the user passes a negative number by mistake
> then I will be able to check it if the type is long and return
> immediately.

signed types have their own issues.  In the end, if you pass a negative value to 
malloc(3), which will be converted to a huge value, sooner or later you will 
probably realize.

> 
> But if size_t is used, then most probably, it will result in a crash -

And I love that.  Crashing is the best thing you can do.  That tells me 
immediately that I wrote a bug.  Isn't that what we wanted in the first place?

> like malloc(-1) will crash the program because unsigned -1 is
> 0xFFFFFFFF and this much memory is not available on today's computers
> and probably may not be available at all in future also (RAM size of
> 2^64 bits is really really huge).

We're not so lucky with malloc(3), since it's virtual memory, and you won't get 
it all at once.  But yes, sooner or later, if you passed -1 to malloc(3), you'll 
see a crash, which is a Good Thing (tm).

> 
> Another thing is that if size_t is used an array index then array[-1]
> will result in wrong behavior or program crash. But with long, the
> developer can check whether the index is negative, thus avoiding
> program crash.

And what do you plan to do when you detect -1 in your code?  Set errno to 
EPROGRAMMERNOTSMARTENOUGH and return -1 from your function?

BTW, just for fun, would anyone please add that errno code to glibc?  It would 
be a nice easter egg  :P

If you weren't smart enough to avoid a bug in your code (and of course nobody is 
smart enough to write 0 bugs), can you yourself write code that is smart enough 
to handle it?  It makes little sense.  What if you write another bug in your 
code handling the bug?

> 
> So, in my opinion, long should be used instead of size_t.

I'd like to change your opinion.  Please read this excellent article by Jens 
Gustedt (member of WG14, the group that develops the ISO C standard) which 
explains why size_t is better:

<https://gustedt.wordpress.com/2013/07/15/a-praise-of-size_t-and-other-unsigned-types/>

> 
> I know that original glibc authors had chosen size_t, so there must be
> some reason for that, however that reason is not clear to me.

The reason why it was chosen is not documented, AFAIK, and could possibly be 
attributed to a historical accident.  However, I'd say it's a great accident if 
it was that way.

But in the article above you may find readons to keep using it for your own code.

> 
> Amit

Cheers,

Alex

-- 
<http://www.alejandro-colomar.es/>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: size_t vs long.
  2022-11-17  9:21 ` Alejandro Colomar via Libc-alpha
@ 2022-11-17  9:48   ` A via Libc-alpha
  2022-11-17 11:00     ` Alejandro Colomar via Libc-alpha
  2022-11-17 19:17   ` Paul Eggert
  1 sibling, 1 reply; 15+ messages in thread
From: A via Libc-alpha @ 2022-11-17  9:48 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: libc-alpha

> >
> > But if size_t is used, then most probably, it will result in a crash -
>
> And I love that.  Crashing is the best thing you can do.  That tells me
> immediately that I wrote a bug.  Isn't that what we wanted in the first place?

No, I don't want a crash if I can get an error value returned back and
errno being set properly.

>
> > like malloc(-1) will crash the program because unsigned -1 is
> > 0xFFFFFFFF and this much memory is not available on today's computers
> > and probably may not be available at all in future also (RAM size of
> > 2^64 bits is really really huge).
>
> We're not so lucky with malloc(3), since it's virtual memory, and you won't get
> it all at once.  But yes, sooner or later, if you passed -1 to malloc(3), you'll
> see a crash, which is a Good Thing (tm).
>

I am programming since last 35 years and I haven't heard this kind of
logic before that crash is better than getting an error value returned
back and errno being set properly. And I have worked for companies
like Cisco Systems and Juniper Networks.

Crash is not a good thing otherwise things like checking the
validity/sanity of arguments passed in a function would not have
existed. If we all loved crashes then we would never check any
argument passed to a function. We would simply go ahead without
checking the arguments and let the function crash and then let the
user debug it. Getting an error value returned back and errno being
set properly is way more easier than debugging the crash. Debugging a
crash can take several man hours, but  by getting an error value back
and then checking errno can solve the issue very quickly.

> >
> > Another thing is that if size_t is used an array index then array[-1]
> > will result in wrong behavior or program crash. But with long, the
> > developer can check whether the index is negative, thus avoiding
> > program crash.
>
> And what do you plan to do when you detect -1 in your code?  Set errno to
> EPROGRAMMERNOTSMARTENOUGH and return -1 from your function?

Looks like you are trying  to make fun of me. I don't appreciate this.
However, you can set errno to something like - "ENEGATIVESUBSCRIPT".

Anyways, to shorten the discussion and making it to the point, I would
like to know why is size_t used in malloc() when a negative value
(passed by user by mistake) can crash the program. Using long and
checking for negative values can prevent the program from crashing.

Some people might say that user should check for value being negative
but for checking for negative values long is required, not size_t. So,
then the user will end up using long instead of size_t. So, in effect
using size_t in malloc is not correct (unless we get further insight
that explains why size_t is correct in malloc()).

Just saying that size_t is good and long is bad (in malloc()) without
any reasons will not make any sense.

Amit

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: size_t vs long.
  2022-11-17  9:48   ` A via Libc-alpha
@ 2022-11-17 11:00     ` Alejandro Colomar via Libc-alpha
  2022-11-17 19:40       ` Jason Duerstock via Libc-alpha
  0 siblings, 1 reply; 15+ messages in thread
From: Alejandro Colomar via Libc-alpha @ 2022-11-17 11:00 UTC (permalink / raw)
  To: A; +Cc: libc-alpha


[-- Attachment #1.1: Type: text/plain, Size: 5842 bytes --]

Hello,

On 11/17/22 10:48, A wrote:
>>>
>>> But if size_t is used, then most probably, it will result in a crash -
>>
>> And I love that.  Crashing is the best thing you can do.  That tells me
>> immediately that I wrote a bug.  Isn't that what we wanted in the first place?
> 
> No, I don't want a crash if I can get an error value returned back and
> errno being set properly.
> 
>>
>>> like malloc(-1) will crash the program because unsigned -1 is
>>> 0xFFFFFFFF and this much memory is not available on today's computers
>>> and probably may not be available at all in future also (RAM size of
>>> 2^64 bits is really really huge).
>>
>> We're not so lucky with malloc(3), since it's virtual memory, and you won't get
>> it all at once.  But yes, sooner or later, if you passed -1 to malloc(3), you'll
>> see a crash, which is a Good Thing (tm).
>>
> 
> I am programming since last 35 years and I haven't heard this kind of
> logic before that crash is better than getting an error value returned
> back and errno being set properly. And I have worked for companies
> like Cisco Systems and Juniper Networks.

Returning error codes is for when the input is wrong but the program logic is 
OK.  When the program logic is wrong, the behaviour of the program is by 
necessity undefined.  And ISO C defines two types of Undefined Behaviour: 
bounded UB, and critical UB.  Bounded UB means basically that you don't 
overwrite any files in your system, or otherwise modify the state of your 
system.  Critical UB means anything else, including wiping your hard drive, and 
daemons getting out of your nose [1].

[1]: (nasal demons)
  <https://www.catb.org/jargon/html/N/nasal-demons.html>
 
<https://stackoverflow.com/questions/32132574/does-undefined-behavior-really-permit-anything-to-happen>
 
<https://stackoverflow.com/questions/13444690/is-passing-additional-parameters-through-function-pointer-legal-defined-in-c/13444785#13444785>


Crashing your program on bounded UB means that you prevent continuing with the 
broken program logic.  Continuing with a broken program logic would very likely 
result in critical undefined behavior, and it could cost millions of dollars, 
depending on how unlucky you are.

> 
> Crash is not a good thing otherwise things like checking the
> validity/sanity of arguments passed in a function would not have
> existed.

Checking validity of arguments is for user input.  It also makes sense for 
static analysis.  Checking at runtime also makes sense if you just want to debug 
an existing program, but you shouldn't modify the program for that (or you would 
be debugging a program that is a different one).

Adding logic for runtime checks that your program logic is correct makes no 
sense at all.

> If we all loved crashes then we would never check any
> argument passed to a function.
strlcpy(3) is designed to crash your program, if the input is not 
NUL-terminated.  The function was designed in OpenBSD, which takes security very 
seriously.  Just an example.

> We would simply go ahead without
> checking the arguments and let the function crash and then let the
> user debug it. Getting an error value returned back and errno being
> set properly is way more easier than debugging the crash. Debugging a
> crash can take several man hours, but  by getting an error value back
> and then checking errno can solve the issue very quickly.

And then what?  Continue with a program that has already proven that has dubious 
logic?  If I detect that error, next I'll be calling is abort(3).  I prefer that 
the kernel kills my program and so I write less code.

> 
>>>
>>> Another thing is that if size_t is used an array index then array[-1]
>>> will result in wrong behavior or program crash. But with long, the
>>> developer can check whether the index is negative, thus avoiding
>>> program crash.
>>
>> And what do you plan to do when you detect -1 in your code?  Set errno to
>> EPROGRAMMERNOTSMARTENOUGH and return -1 from your function?
> 
> Looks like you are trying  to make fun of me.

Not really.  I was trying to be explicative, while being a bit funny, to make it 
more entertaining to read.  But as Brian Kernighan said, noone is smart enough 
to debug its own code:

[
Debugging is twice as hard as writing the code in the first place. Therefore, if 
you write the code as cleverly as possible, you are, by definition, not smart 
enough to debug it.
]
         — Brian W. Kernighan and P. J. Plauger in The Elements of Programming 
Style.

<http://quotes.cat-v.org/programming/#bwk>

> I don't appreciate this.
> However, you can set errno to something like - "ENEGATIVESUBSCRIPT".

And then, abort(3).

> 
> Anyways, to shorten the discussion and making it to the point, I would
> like to know why is size_t used in malloc() when a negative value
> (passed by user by mistake) can crash the program. Using long and
> checking for negative values can prevent the program from crashing.

I already said it's probably a historic accident.

> 
> Some people might say that user should check for value being negative
> but for checking for negative values long is required, not size_t. So,
> then the user will end up using long instead of size_t.

You could check for a random high value with size_t; some call it RSIZE_MAX. 
It's the same thing in the end.

> So, in effect
> using size_t in malloc is not correct (unless we get further insight
> that explains why size_t is correct in malloc()).
> 
> Just saying that size_t is good and long is bad (in malloc()) without
> any reasons will not make any sense.

I pointed to an excellent article by Jens Gustedt.  Please read it.

> 
> Amit


Alex

-- 
<http://www.alejandro-colomar.es/>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: size_t vs long.
  2022-11-17  9:21 ` Alejandro Colomar via Libc-alpha
  2022-11-17  9:48   ` A via Libc-alpha
@ 2022-11-17 19:17   ` Paul Eggert
  2022-11-17 20:27     ` Alejandro Colomar via Libc-alpha
  1 sibling, 1 reply; 15+ messages in thread
From: Paul Eggert @ 2022-11-17 19:17 UTC (permalink / raw)
  To: Alejandro Colomar, A, libc-alpha

On 2022-11-17 01:21, Alejandro Colomar via Libc-alpha wrote:

> I'd like to change your opinion.  Please read this excellent article by 
> Jens Gustedt (member of WG14, the group that develops the ISO C 
> standard) which explains why size_t is better:
> 
> <https://gustedt.wordpress.com/2013/07/15/a-praise-of-size_t-and-other-unsigned-types/>

Sorry, but that article is not excellent: it's mostly wrong. Among other 
things it says size_t is better because it lets you write code like this:

> for (size_t i = 41; i < sizeof A / sizeof A[0]; --i) {
>    A[i] = something_nice;
> }

and that there will be "No traps, no signals, no exceptions".

First, Gustedt technically incorrect, because the code *can* trap on 
platforms where SIZE_MAX <= INT_MAX, because on such a platform when i 
is zero, '--i' can store a trap value into i.

Second and more important, that code is bogus. Nobody should ever write 
code like that. If I wrote code like that, I'd *want* a trap. Traps are 
*good* when they prevent buggy code from doing further damage.

For what it's worth, in Gnulib's more recent code we've been using the 
type "idx_t". It is a signed type, thus avoiding C's bug-inducing 
comparison rules, where most size_t values compare to be less than -1.
However, by convention idx_t contains only nonnegative values.

The idx_t type is *much* better than size_t, both because we can tell 
the compiler to do some overflow checking on it, and because it compares 
nicely to ordinary integers. This overcomes two major disadvantages of 
size_t.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: size_t vs long.
  2022-11-17 11:00     ` Alejandro Colomar via Libc-alpha
@ 2022-11-17 19:40       ` Jason Duerstock via Libc-alpha
  2022-11-17 20:01         ` Alejandro Colomar via Libc-alpha
  0 siblings, 1 reply; 15+ messages in thread
From: Jason Duerstock via Libc-alpha @ 2022-11-17 19:40 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: A, GNU C Library

On Thu, Nov 17, 2022 at 6:01 AM Alejandro Colomar via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> Not really.  I was trying to be explicative, while being a bit funny, to make it
> more entertaining to read.  But as Brian Kernighan said, noone is smart enough
> to debug its own code:
>
> [
> Debugging is twice as hard as writing the code in the first place. Therefore, if
> you write the code as cleverly as possible, you are, by definition, not smart
> enough to debug it.
> ]
>          — Brian W. Kernighan and P. J. Plauger in The Elements of Programming
> Style.

Not to get too far off topic, but I took this to mean that one should
endeavor to be boring and straightforward when coding unless it was
absolutely necessary to be clever, as the boring and straightforward
code would be easier to debug.

Or at the very least, one should never be more than 50% clever about it.

Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: size_t vs long.
  2022-11-17 19:40       ` Jason Duerstock via Libc-alpha
@ 2022-11-17 20:01         ` Alejandro Colomar via Libc-alpha
  0 siblings, 0 replies; 15+ messages in thread
From: Alejandro Colomar via Libc-alpha @ 2022-11-17 20:01 UTC (permalink / raw)
  To: Jason Duerstock; +Cc: A, GNU C Library


[-- Attachment #1.1: Type: text/plain, Size: 1503 bytes --]

Hi Jason,

On 11/17/22 20:40, Jason Duerstock wrote:
> On Thu, Nov 17, 2022 at 6:01 AM Alejandro Colomar via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> Not really.  I was trying to be explicative, while being a bit funny, to make it
>> more entertaining to read.  But as Brian Kernighan said, noone is smart enough
>> to debug its own code:
>>
>> [
>> Debugging is twice as hard as writing the code in the first place. Therefore, if
>> you write the code as cleverly as possible, you are, by definition, not smart
>> enough to debug it.
>> ]
>>           — Brian W. Kernighan and P. J. Plauger in The Elements of Programming
>> Style.
> 
> Not to get too far off topic, but I took this to mean that one should
> endeavor to be boring and straightforward when coding unless it was
> absolutely necessary to be clever, as the boring and straightforward
> code would be easier to debug.

Yes, that a valid reading of it.

My extended reading of it is:

-  Code is hard to debug by yourself; keep it simple (your point, I think).
-  If you write more complex code trying to debug your own code, you are by 
necessity violating the previous point.
-  If you went ahead and wrote code to debug your own code, will you go further 
to write code that debugs the debugging code?
-  The algorithm crashed.  :)

> 
> Or at the very least, one should never be more than 50% clever about it.
> 
> Jason

Cheers,

Alex

-- 
<http://www.alejandro-colomar.es/>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: size_t vs long.
  2022-11-17 19:17   ` Paul Eggert
@ 2022-11-17 20:27     ` Alejandro Colomar via Libc-alpha
  2022-11-17 21:39       ` Paul Eggert
  0 siblings, 1 reply; 15+ messages in thread
From: Alejandro Colomar via Libc-alpha @ 2022-11-17 20:27 UTC (permalink / raw)
  To: Paul Eggert, A, libc-alpha


[-- Attachment #1.1: Type: text/plain, Size: 5036 bytes --]

Hi Paul,

On 11/17/22 20:17, Paul Eggert wrote:
> On 2022-11-17 01:21, Alejandro Colomar via Libc-alpha wrote:
> 
>> I'd like to change your opinion.  Please read this excellent article by Jens 
>> Gustedt (member of WG14, the group that develops the ISO C standard) which 
>> explains why size_t is better:
>>
>> <https://gustedt.wordpress.com/2013/07/15/a-praise-of-size_t-and-other-unsigned-types/>
> 
> Sorry, but that article is not excellent: it's mostly wrong. Among other things 
> it says size_t is better because it lets you write code like this:
> 
>> for (size_t i = 41; i < sizeof A / sizeof A[0]; --i) {
>>    A[i] = something_nice;
>> }
> 
> and that there will be "No traps, no signals, no exceptions".
> 
> First, Gustedt technically incorrect, because the code *can* trap on platforms 
> where SIZE_MAX <= INT_MAX,
First of all, let me suggest that this is not a problem of that kind of code, 
but rather a bug in the language: default promotion to int is the underlying 
problem, and root of much evil.

But let's continue.  SIZE_MAX <= INT_MAX really amounts to platforms where 
sizeof(size_t) < sizeof(int).

I honestly don't know of any existing platforms where that is true, and I've 
been searching, but couldn't find any.  I expect that it's possible that one of 
those very old unicorn platforms may make this true, and if you know of any, I'm 
curious to know which is it.

For future platforms, since we've learnt that we want size_t to be at least 64 
bits, I guess this can happen in a hypothetical platform where size_t is 64 
bits, and int is 128.  I hope we've also learnt that default promotion to int it 
bad.  And so I hope that no-one develops such an arch, and that we all do our 
best to try and minimize the damage that default promotion to int can do, by 
keeping int smaller than most useful sizes, be it by increasing size_t, or by 
not increasing int.

For the time being, and while no one points me to an existing platform where 
sizeof(size_t) < sizeof(int) (and even if it exists, I only care about POSIX 
platforms, where we can probably assume that's not going to happen ever, if only 
for not breaking existing code), I'll assume such a platform doesn't exist.

If this ever becomes a real concern:

_Static_assert(sizeof(size_t) < sizeof(int), "This platform is out of luck.");

> because on such a platform when i is zero, '--i' can 
> store a trap value into i.

So many things need to be broken in an arch for that to happen.  BTW, C23 will 
require that signed integers are 2's complement, which I guess removes the 
possibility of a trap, IIRC.  But I, as you, prefer the trap if I meet an arch 
where I get promotion from size_t to int.

> 
> Second and more important, that code is bogus. Nobody should ever write code 
> like that. If I wrote code like that, I'd *want* a trap.

for (size_t i = 41; i < sizeof A / sizeof A[0]; --i) {
    A[i] = something_nice;
}

The code above seems a bug by not being used to it.  Once you get used to it, it 
can become natural, but let's go for the more natural:


for (size_t i = 0; i < sizeof A / sizeof A[0]; ++i) {
    A[i] = something_nice;
}

The main advantage of this code compared to the equivalent ssize_t or ptrdiff_t 
or idx_t code is that if you somehow write an off-by-one error, and manage to 
access the array at [-1], if i is unsigned you'll access [SIZE_MAX], which will 
definitely crash your program.  An access to [-1] might instead overwrite some 
valuable data.  This is an important point for unsigned types.

> Traps are *good* when 
> they prevent buggy code from doing further damage.

We seem to agree on this sentence.  It's actually the main reason I like size_t 
for indices, as explained in my paragraph above.

> 
> For what it's worth, in Gnulib's more recent code we've been using the type 
> "idx_t". It is a signed type, thus avoiding C's bug-inducing comparison rules, 
> where most size_t values compare to be less than -1.
> However, by convention idx_t contains only nonnegative values.

I agree that one should try to avoid comparing signed and unsigned integers. 
That's actually very doable.  The main issue against using unsigned indices is 
'argc', where I can't use unsigned.  For the rest of the code, it is usually 
easy to keep the separation between signed and unsigned types, and not mix them.

> 
> The idx_t type is *much* better than size_t, both because we can tell the 
> compiler to do some overflow checking on it, and because it compares nicely to 
> ordinary integers. This overcomes two major disadvantages of size_t.

Ignoring odd platforms, idx_t looses much of its advantage.  With idx_t you may 
be able to use sanitizers to check overflow (if you don't, it's likely that 
you'll invoke critical UB, since [-1] is unlikely to crash).  With size_t you 
get a crash for free if you go off-by-minus-one.

Cheers,

Alex

-- 
<http://www.alejandro-colomar.es/>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: size_t vs long.
  2022-11-17 20:27     ` Alejandro Colomar via Libc-alpha
@ 2022-11-17 21:39       ` Paul Eggert
  2022-11-17 23:04         ` Alejandro Colomar via Libc-alpha
  2022-11-18  2:11         ` size_t vs long Maciej W. Rozycki
  0 siblings, 2 replies; 15+ messages in thread
From: Paul Eggert @ 2022-11-17 21:39 UTC (permalink / raw)
  To: Alejandro Colomar, A, libc-alpha

>> Second and more important, that code is bogus. Nobody should ever write code like that. If I wrote code like that, I'd *want* a trap.
> 
> for (size_t i = 41; i < sizeof A / sizeof A[0]; --i) {
>    A[i] = something_nice;
> }
> 
> The code above seems a bug by not being used to it.  Once you get used to it, it can become natural, but let's go for the more natural:
> 
> 
> for (size_t i = 0; i < sizeof A / sizeof A[0]; ++i) {
>    A[i] = something_nice;
> } 

Those loops do not mean the same thing. The first is bogus; the second 
one is OK (notice, the bogus loop has a "41", the OK loop doesn't).

I'm not surprised you didn't notice how bogus the first loop was - most 
people wouldn't notice it either. And it's Gustedt's main point! I don't 
know why he went off the rails with that overly-clever code, but he did.


> The main advantage of this code compared to the equivalent ssize_t or ptrdiff_t or idx_t code is that if you somehow write an off-by-one error, and manage to access the array at [-1], if i is unsigned you'll access [SIZE_MAX], which will definitely crash your program.

That's not true on the vast majority of today's platforms, which don't 
have subscript checking, and for which a[-1] is treated the same way 
a[SIZE_MAX] is. On my platform (Fedora 36 x86-64) the same machine code 
is generated for 'a' and 'b' for the following C code.

   #include <stdint.h>
   int a(int *p) { return p[-1]; }
   int b(int *p) { return p[SIZE_MAX]; }

Yes, debugging implementations might catch p[SIZE_MAX], but the ones 
that do will likely catch p[-1] as well.

In short, there's little advantage to using size_t for indexes, and 
there are real disadvantages due to comparison confusion and lack of 
signed integer overflow checking.


>> First, Gustedt technically incorrect, because the code *can* trap on 
>> platforms where SIZE_MAX <= INT_MAX,

> I honestly don't know of any existing platforms where that is true

They're a dying breed. The main problem from my point of view is that C 
and POSIX allow these oddballs, so if you want to write really portable 
code you have to worry about them - and this understadably discourages 
people from writing really portable code. (What's the point of coding to 
the standards if it's just a bunch of make-work?)

Anyway, one example is Unisys Clearpath C, in which INT_MAX and SIZE_MAX 
both equal 2**39 - 1. This is allowed by the current POSIX and C 
standards, and this compiler is still for sale and supported. (I doubt 
whether they'll port it to C23, so there's that....)


> C23 will require that signed integers are 2's complement, which I guess 
> removes the possibility of a trap

It doesn't remove the possibility, since signed integers can have trap 
representations. But we are straying from the more important point.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: size_t vs long.
  2022-11-17  7:02 size_t vs long A via Libc-alpha
  2022-11-17  9:21 ` Alejandro Colomar via Libc-alpha
@ 2022-11-17 21:58 ` DJ Delorie via Libc-alpha
  1 sibling, 0 replies; 15+ messages in thread
From: DJ Delorie via Libc-alpha @ 2022-11-17 21:58 UTC (permalink / raw)
  To: A; +Cc: libc-alpha

A via Libc-alpha <libc-alpha@sourceware.org> writes:
> I prefer long over size_t.

On many platforms, long and size_t are not the same size (in bits).  On
many 16 bit platforms, long is 32 bits and size_t (and all pointers) are
16.  On some platforms, long is only 32 bits and size_t is 64.  In the
X32 ABI long is 64 bits but size_t is only 32.  It's because these cases
exist that there are separate types for numbers vs pointers vs sizes.

> This is because, in case the user passes a negative number by mistake
> then I will be able to check it if the type is long and return
> immediately.

You want ssize_t then.  It's the same size as size_t, but signed.  For
math on pointers, use ptrdiff_t.  There's even an intptr_t (and
uintptr_t) that is whatever integer type is the same size as a pointer.

> But if size_t is used, then most probably, it will result in a crash -
> like malloc(-1) will crash the program because unsigned -1 is

malloc should not accept a size greater than half the address space (as
per the spec) so it won't matter if the parameter is signed or unsigned.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: size_t vs long.
  2022-11-17 21:39       ` Paul Eggert
@ 2022-11-17 23:04         ` Alejandro Colomar via Libc-alpha
  2022-11-23 20:08           ` Using size_t to crash on off-by-one errors (was: size_t vs long.) Alejandro Colomar via Libc-alpha
  2022-11-18  2:11         ` size_t vs long Maciej W. Rozycki
  1 sibling, 1 reply; 15+ messages in thread
From: Alejandro Colomar via Libc-alpha @ 2022-11-17 23:04 UTC (permalink / raw)
  To: Paul Eggert, A, libc-alpha; +Cc: gcc


[-- Attachment #1.1: Type: text/plain, Size: 9017 bytes --]

Hi Paul,

On 11/17/22 22:39, Paul Eggert wrote:
>>> Second and more important, that code is bogus. Nobody should ever write code 
>>> like that. If I wrote code like that, I'd *want* a trap.
>>
>> for (size_t i = 41; i < sizeof A / sizeof A[0]; --i) {
>>    A[i] = something_nice;
>> }
>>
>> The code above seems a bug by not being used to it.  Once you get used to it, 
>> it can become natural, but let's go for the more natural:
>>
>>
>> for (size_t i = 0; i < sizeof A / sizeof A[0]; ++i) {
>>    A[i] = something_nice;
>> } 
> 
> Those loops do not mean the same thing.

Sorry, I didn't mean that they are the same.  For code that means exactly the 
same as

for (size_t i = 41; i < nitems(A); --i) {
     A[i] = something_nice;
}


we need:


for (size_t i = 0; i < nitems(A) && i <= 41; ++i) {
     A[i] = something_nice;
}

or

for (idx_t i = 0; i < nitems(A) && i <= 41; ++i) {
     A[i] = something_nice;
}

or

for (idx_t i = 41; i < nitems(A) && i > 0; --i) {
     A[i] = something_nice;
}

(always assuming SIZE_MAX > INT_MAX.)


Without more context, I can't know if any of them are bogus.  Of course, that 41 
should normally come from a variable instead of a magic number.

We can see that Jens's code is at least simpler, which is a bonus point.

But normally, we don't enter a loop from a random entry value, but rather one of 
its extremes, which is what I showed in my alternative example.  Let's also show 
the alternative forms we can write it:


for (size_t i = 0; i < nitems(A); ++i) {
     A[i] = something_nice;
}


is equivalent to:


for (size_t i = nitems(A) - 1; i < nitems(A); --i) {
     A[i] = something_nice;
}

or

for (idx_t i = nitems(A) - 1; i > 0; --i) {
     A[i] = something_nice;
}

or

for (idx_t i = 0; i < nitems(A); ++i) {
     A[i] = something_nice;
}


There's not much difference in this case regarding readability.

If 'i' is modified within the loop (apart from the obvious --i_, then we need to 
check both bounds of the array, which with size_t comes for free; with idx_t you 
need to add code:



for (idx_t i = nitems(A) - 1; i < nitems(A) && i > 0; --i) {
     A[i] = something_nice;
     i += foo;
}

or

for (idx_t i = 0; i < nitems(A) && i > 0; ++i) {
     A[i] = something_nice;
     i += foo;
}

vs

for (size_t i = 0; i < nitems(A); ++i) {
     A[i] = something_nice;
     i += foo;
}

or

for (size_t i = nitems(A) - 1; i < nitems(A); --i) {
     A[i] = something_nice;
     i += foo;
}


Again, size_t seems to win in simplicity.


> The first is bogus; the second one is OK 
> (notice, the bogus loop has a "41", the OK loop doesn't).
> 
> I'm not surprised you didn't notice how bogus the first loop was - most people 
> wouldn't notice it either. And it's Gustedt's main point! I don't know why he 
> went off the rails with that overly-clever code, but he did.

I still don't know what was the intended bug.  Or why it would differ from the 
idx_t versions.  Please detail.

> 
> 
>> The main advantage of this code compared to the equivalent ssize_t or 
>> ptrdiff_t or idx_t code is that if you somehow write an off-by-one error, and 
>> manage to access the array at [-1], if i is unsigned you'll access [SIZE_MAX], 
>> which will definitely crash your program.
> 
> That's not true on the vast majority of today's platforms, which don't have 
> subscript checking, and for which a[-1] is treated the same way a[SIZE_MAX] is. 
> On my platform (Fedora 36 x86-64) the same machine code is generated for 'a' and 
> 'b' for the following C code.
> 
>    #include <stdint.h>
>    int a(int *p) { return p[-1]; }
>    int b(int *p) { return p[SIZE_MAX]; }

Hmm, this seems to be true in my platform (amd64) per the experiment I just did:

$ cat s.c
#include <sys/types.h>

char
f(char *p, ssize_t i)
{
	return p[i];
}
$ cat u.c
#include <stddef.h>

char
f(char *p, size_t i)
{
	return p[i];
}
$ cc -Wall -Wextra -Werror -S -O3 s.c u.c
$ diff -u u.s s.s
--- u.s	2022-11-17 23:41:47.773805041 +0100
+++ s.s	2022-11-17 23:41:47.761805265 +0100
@@ -1,15 +1,15 @@
-	.file	"u.c"
+	.file	"s.c"
  	.text
  	.p2align 4
  	.globl	f
  	.type	f, @function
  f:
-.LFB0:
+.LFB6:
  	.cfi_startproc
  	movzbl	(%rdi,%rsi), %eax
  	ret
  	.cfi_endproc
-.LFE0:
+.LFE6:
  	.size	f, .-f
  	.ident	"GCC: (Debian 12.2.0-9) 12.2.0"
  	.section	.note.GNU-stack,"",@progbits


It seems a violation of the standard, isn't it?

The operator [] doesn't have a type, and an argument to it should be treated 
with whatever type it has after default promotions.  If I pass a size_t to it, 
the type should be unsigned, and that should be preserved, by accessing the 
array at a high value, which the compiler has no way to know if it will exist or 
not, by that function definition.  The extreme of -1 and SIZE_MAX might be not 
the best one, since we would need a pointer to be 0 to be accessible at 
[SIZE_MAX], but if you replace those by -RANDOM, and (size_t)-RANDOM, then the 
compiler definitely needs to generate different code, yet it doesn't.

I'm guessing this is an optimization by GCC knowing that we will never be close 
to using the whole 64-bit address space.  If we use int and unsigned, things change:

$ cat s.c
char
f(char *p, int i)
{
	return p[i];
}
alx@asus5775:~/tmp$ cat u.c
char
f(char *p, unsigned i)
{
	return p[i];
}
$ cc -Wall -Wextra -Werror -S -O3 s.c u.c
$ diff -u u.s s.s
--- u.s	2022-11-17 23:44:54.446318186 +0100
+++ s.s	2022-11-17 23:44:54.434318409 +0100
@@ -1,4 +1,4 @@
-	.file	"u.c"
+	.file	"s.c"
  	.text
  	.p2align 4
  	.globl	f
@@ -6,7 +6,7 @@
  f:
  .LFB0:
  	.cfi_startproc
-	movl	%esi, %esi
+	movslq	%esi, %rsi
  	movzbl	(%rdi,%rsi), %eax
  	ret
  	.cfi_endproc


I'm guessing that GCC doesn't do the assumption here, and I guess the unsigned 
version would crash, while the signed version would cause nasal demons.  Anyway, 
now that I'm here, I'll test it:


$ cat s.c
[[gnu::noipa]]
char
f(char *p, int i)
{
	return p[i];
}

int main(void)
{
	int i = -1;
	char c[4];

	return f(c, i);
}
$ cc -Wall -Wextra -Werror -O3 s.c
$ ./a.out
$ echo $?
0


$ cat u.c
[[gnu::noipa]]
char
f(char *p, unsigned i)
{
	return p[i];
}

int main(void)
{
	unsigned i = -1;
	char c[4];

	return f(c, i);
}
$ cc -Wall -Wextra -Werror -O3 u.c
$ ./a.out
Segmentation fault


I get this SEGV difference consistently.  I CCed gcc@ in case they consider this 
to be something they want to address.  Maybe the optimization is important for 
size_t-sized indices, but if it is not, I'd prefer getting the SEGV for SIZE_MAX.

> 
> Yes, debugging implementations might catch p[SIZE_MAX], but the ones that do 
> will likely catch p[-1] as well.
> 
> In short, there's little advantage to using size_t for indexes, and there are 
> real disadvantages due to comparison confusion and lack of signed integer 
> overflow checking.
> 
> 
>>> First, Gustedt technically incorrect, because the code *can* trap on 
>>> platforms where SIZE_MAX <= INT_MAX,
> 
>> I honestly don't know of any existing platforms where that is true
> 
> They're a dying breed. The main problem from my point of view is that C and 
> POSIX allow these oddballs, so if you want to write really portable code you 
> have to worry about them - and this understadably discourages people from 
> writing really portable code. (What's the point of coding to the standards if 
> it's just a bunch of make-work?)

I understand your point, since you work on highly-portable code.

But there's always a tradeoff, and I'd very much like that the standards didn't 
allow such oddballs.  They're cleaning up with C23, so it seems to be going in 
the good direction.  I remember this discussion we had with ILP64.  I hope in 
the future those oddballs get reduced considerably.  Luckily, SIZE_MAX>INT_MAX 
seems to be quite more uncommon than ILP64.

> 
> Anyway, one example is Unisys Clearpath C, in which INT_MAX and SIZE_MAX both 
> equal 2**39 - 1.

Lol.  It would have been fun to see it be 2**42 - 1.

> This is allowed by the current POSIX and C standards, and this 
> compiler is still for sale and supported. (I doubt whether they'll port it to 
> C23, so there's that....)

Heh, I now don't remember the name of this compiler that attempted to implement 
UB in the most unexpected ways on purpose just for fun, which was itself an 
interesting experiment to check portability of a given code.  It reminds me of that.

> 
> 
>> C23 will require that signed integers are 2's complement, which I guess 
>> removes the possibility of a trap
> 
> It doesn't remove the possibility, since signed integers can have trap 
> representations. But we are straying from the more important point.
> 

Cheers,

Alex

-- 
<http://www.alejandro-colomar.es/>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: size_t vs long.
  2022-11-17 21:39       ` Paul Eggert
  2022-11-17 23:04         ` Alejandro Colomar via Libc-alpha
@ 2022-11-18  2:11         ` Maciej W. Rozycki
  2022-11-18  2:47           ` Paul Eggert
  1 sibling, 1 reply; 15+ messages in thread
From: Maciej W. Rozycki @ 2022-11-18  2:11 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Alejandro Colomar, A, libc-alpha

On Thu, 17 Nov 2022, Paul Eggert wrote:

> > > Second and more important, that code is bogus. Nobody should ever write
> > > code like that. If I wrote code like that, I'd *want* a trap.
> > 
> > for (size_t i = 41; i < sizeof A / sizeof A[0]; --i) {
> >    A[i] = something_nice;
> > }
> > 
> > The code above seems a bug by not being used to it.  Once you get used to
> > it, it can become natural, but let's go for the more natural:
> > 
> > 
> > for (size_t i = 0; i < sizeof A / sizeof A[0]; ++i) {
> >    A[i] = something_nice;
> > } 
> 
> Those loops do not mean the same thing. The first is bogus; the second one is
> OK (notice, the bogus loop has a "41", the OK loop doesn't).
> 
> I'm not surprised you didn't notice how bogus the first loop was - most people
> wouldn't notice it either. And it's Gustedt's main point! I don't know why he
> went off the rails with that overly-clever code, but he did.

 The rest of the discussion aside, what exactly is bogus with the first 
loop?

 AFAICT if index 41 is within the bounds of A, it fills elements [0..41] 
with something_nice and otherwise it does nothing.  I am not quite sure 
offhand what the purpose of such code would be, but otherwise I find it 
pretty straightforward.  Presetting i to (sizeof A / sizeof A[0] - 1) or 
another calculated value taking the size of the array into account would 
be more common.

 Have I missed anything?

  Maciej

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: size_t vs long.
  2022-11-18  2:11         ` size_t vs long Maciej W. Rozycki
@ 2022-11-18  2:47           ` Paul Eggert
  2022-11-23 20:01             ` Alejandro Colomar via Libc-alpha
  0 siblings, 1 reply; 15+ messages in thread
From: Paul Eggert @ 2022-11-18  2:47 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: Alejandro Colomar, A, libc-alpha

On 11/17/22 18:11, Maciej W. Rozycki wrote:

>>> for (size_t i = 41; i < sizeof A / sizeof A[0]; --i) {
>>>     A[i] = something_nice;
>>> }

>  ... what exactly is bogus with the first loop?
> 
>   AFAICT if index 41 is within the bounds of A, it fills elements [0..41]
> with something_nice and otherwise it does nothing.

Yes, and that's precisely what is bogus about it. Most people who read 
that code won't easily see that you're summarizing it correctly 
(assuming INT_MAX < SIZE_MAX). And if I saw that code in a real program, 
my first guess - and it most likely would be the correct guess - is that 
the *author* of the code didn't know what it does, it's so confusingly 
written.

Certainly Alejandro was confused by that bogus loop, as his most recent 
email said the following:

> For code that means exactly the same as
> 
> for (size_t i = 41; i < nitems(A); --i) {
>     A[i] = something_nice;
> }
> 
> 
> we need:
> 
> 
> for (size_t i = 0; i < nitems(A) && i <= 41; ++i) {
>     A[i] = something_nice;
> }
> 
> or
> 
> for (idx_t i = 0; i < nitems(A) && i <= 41; ++i) {
>     A[i] = something_nice;
> }
> 
> or
> 
> for (idx_t i = 41; i < nitems(A) && i > 0; --i) {
>     A[i] = something_nice;
> }
> 
> (always assuming SIZE_MAX > INT_MAX.) 

and if we call those four loops A, B, C and D, then Alejandro was 
incorrect about B and C because they are not equivalent to A. And 
although D is equivalent to A, D is still confusing and is still Bad Code.

Code like this should be written more the way you said it. E.g.:

    idx_t nice_count = 42;
    if (nice_count <= nitems(A))
      for (idx_t i = 0; i < nice_count; i++)
        A[i] = something_nice;

This is much easier to understand than the other alternatives given so 
far, for reasons that I hope are obvious. And as a bonus, it doesn't 
assume INT_MAX < SIZE_MAX.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: size_t vs long.
  2022-11-18  2:47           ` Paul Eggert
@ 2022-11-23 20:01             ` Alejandro Colomar via Libc-alpha
  0 siblings, 0 replies; 15+ messages in thread
From: Alejandro Colomar via Libc-alpha @ 2022-11-23 20:01 UTC (permalink / raw)
  To: Paul Eggert; +Cc: A, libc-alpha, Maciej W. Rozycki


[-- Attachment #1.1: Type: text/plain, Size: 2793 bytes --]

Hi Paul,

On 11/18/22 03:47, Paul Eggert wrote:
> On 11/17/22 18:11, Maciej W. Rozycki wrote:
> 
>>>> for (size_t i = 41; i < sizeof A / sizeof A[0]; --i) {
>>>>     A[i] = something_nice;
>>>> }
> 
>>  ... what exactly is bogus with the first loop?
>>
>>   AFAICT if index 41 is within the bounds of A, it fills elements [0..41]
>> with something_nice and otherwise it does nothing.
> 
> Yes, and that's precisely what is bogus about it. Most people who read that code 
> won't easily see that you're summarizing it correctly (assuming INT_MAX < 
> SIZE_MAX). And if I saw that code in a real program, my first guess - and it 
> most likely would be the correct guess - is that the *author* of the code didn't 
> know what it does, it's so confusingly written.
> 
> Certainly Alejandro was confused by that bogus loop, as his most recent email 
> said the following:

I had a mistake.  As you said, I didn't to address that while reversing the loop :/

> 
>> For code that means exactly the same as
>>
>> for (size_t i = 41; i < nitems(A); --i) {
>>     A[i] = something_nice;
>> }
>>
>>
>> we need:
>>
>>
>> for (size_t i = 0; i < nitems(A) && i <= 41; ++i) {
>>     A[i] = something_nice;
>> }
>>
>> or
>>
>> for (idx_t i = 0; i < nitems(A) && i <= 41; ++i) {
>>     A[i] = something_nice;
>> }
>>
>> or
>>
>> for (idx_t i = 41; i < nitems(A) && i > 0; --i) {
>>     A[i] = something_nice;
>> }
>>
>> (always assuming SIZE_MAX > INT_MAX.) 
> 
> and if we call those four loops A, B, C and D, then Alejandro was incorrect 
> about B and C because they are not equivalent to A. And although D is equivalent 
> to A, D is still confusing and is still Bad Code.
> 
> Code like this should be written more the way you said it. E.g.:
> 
>     idx_t nice_count = 42;
>     if (nice_count <= nitems(A))
>       for (idx_t i = 0; i < nice_count; i++)
>         A[i] = something_nice;
> 
> This is much easier to understand than the other alternatives given so far, for 
> reasons that I hope are obvious. And as a bonus, it doesn't assume INT_MAX < 
> SIZE_MAX.

However, I don't think that can be attributed to unsigned indices, since the 
signed version of the --i loop also doesn't need the if (and I got it right).  I 
rather attribute it to the complexity of for loops where the entry point is not 
an extreme (and also to myself, which should have been more careful).

Also, your version using a limiting 42 instead of 41 changes the meaning 
slightly: the one with 41 means "I want to start at index 41", while the one 
with 42 means "I want to do something for the first 42 elements".

But let's agree to disagree :)

Cheers,

Alex


-- 
<http://www.alejandro-colomar.es/>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Using size_t to crash on off-by-one errors (was: size_t vs long.)
  2022-11-17 23:04         ` Alejandro Colomar via Libc-alpha
@ 2022-11-23 20:08           ` Alejandro Colomar via Libc-alpha
  0 siblings, 0 replies; 15+ messages in thread
From: Alejandro Colomar via Libc-alpha @ 2022-11-23 20:08 UTC (permalink / raw)
  To: Paul Eggert, libc-alpha; +Cc: gcc, A


[-- Attachment #1.1: Type: text/plain, Size: 4925 bytes --]

Hi,

On 11/18/22 00:04, Alejandro Colomar wrote:
>>> The main advantage of this code compared to the equivalent ssize_t or 
>>> ptrdiff_t or idx_t code is that if you somehow write an off-by-one error, and 
>>> manage to access the array at [-1], if i is unsigned you'll access 
>>> [SIZE_MAX], which will definitely crash your program.
>>
>> That's not true on the vast majority of today's platforms, which don't have 
>> subscript checking, and for which a[-1] is treated the same way a[SIZE_MAX] 
>> is. On my platform (Fedora 36 x86-64) the same machine code is generated for 
>> 'a' and 'b' for the following C code.
>>
>>    #include <stdint.h>
>>    int a(int *p) { return p[-1]; }
>>    int b(int *p) { return p[SIZE_MAX]; }
> 
> Hmm, this seems to be true in my platform (amd64) per the experiment I just did:
> 
> $ cat s.c
> #include <sys/types.h>
> 
> char
> f(char *p, ssize_t i)
> {
>      return p[i];
> }
> $ cat u.c
> #include <stddef.h>
> 
> char
> f(char *p, size_t i)
> {
>      return p[i];
> }
> $ cc -Wall -Wextra -Werror -S -O3 s.c u.c
> $ diff -u u.s s.s
> --- u.s    2022-11-17 23:41:47.773805041 +0100
> +++ s.s    2022-11-17 23:41:47.761805265 +0100
> @@ -1,15 +1,15 @@
> -    .file    "u.c"
> +    .file    "s.c"
>       .text
>       .p2align 4
>       .globl    f
>       .type    f, @function
>   f:
> -.LFB0:
> +.LFB6:
>       .cfi_startproc
>       movzbl    (%rdi,%rsi), %eax
>       ret
>       .cfi_endproc
> -.LFE0:
> +.LFE6:
>       .size    f, .-f
>       .ident    "GCC: (Debian 12.2.0-9) 12.2.0"
>       .section    .note.GNU-stack,"",@progbits
> 
> 
> It seems a violation of the standard, isn't it?
> 
> The operator [] doesn't have a type, and an argument to it should be treated 
> with whatever type it has after default promotions.  If I pass a size_t to it, 
> the type should be unsigned, and that should be preserved, by accessing the 
> array at a high value, which the compiler has no way to know if it will exist or 
> not, by that function definition.  The extreme of -1 and SIZE_MAX might be not 
> the best one, since we would need a pointer to be 0 to be accessible at 
> [SIZE_MAX], but if you replace those by -RANDOM, and (size_t)-RANDOM, then the 
> compiler definitely needs to generate different code, yet it doesn't.
> 
> I'm guessing this is an optimization by GCC knowing that we will never be close 
> to using the whole 64-bit address space.  If we use int and unsigned, things 
> change:
> 
> $ cat s.c
> char
> f(char *p, int i)
> {
>      return p[i];
> }
> alx@asus5775:~/tmp$ cat u.c
> char
> f(char *p, unsigned i)
> {
>      return p[i];
> }
> $ cc -Wall -Wextra -Werror -S -O3 s.c u.c
> $ diff -u u.s s.s
> --- u.s    2022-11-17 23:44:54.446318186 +0100
> +++ s.s    2022-11-17 23:44:54.434318409 +0100
> @@ -1,4 +1,4 @@
> -    .file    "u.c"
> +    .file    "s.c"
>       .text
>       .p2align 4
>       .globl    f
> @@ -6,7 +6,7 @@
>   f:
>   .LFB0:
>       .cfi_startproc
> -    movl    %esi, %esi
> +    movslq    %esi, %rsi
>       movzbl    (%rdi,%rsi), %eax
>       ret
>       .cfi_endproc
> 
> 
> I'm guessing that GCC doesn't do the assumption here, and I guess the unsigned 
> version would crash, while the signed version would cause nasal demons.  Anyway, 
> now that I'm here, I'll test it:
> 
> 
> $ cat s.c
> [[gnu::noipa]]
> char
> f(char *p, int i)
> {
>      return p[i];
> }
> 
> int main(void)
> {
>      int i = -1;
>      char c[4];
> 
>      return f(c, i);
> }
> $ cc -Wall -Wextra -Werror -O3 s.c
> $ ./a.out
> $ echo $?
> 0
> 
> 
> $ cat u.c
> [[gnu::noipa]]
> char
> f(char *p, unsigned i)
> {
>      return p[i];
> }
> 
> int main(void)
> {
>      unsigned i = -1;
>      char c[4];
> 
>      return f(c, i);
> }
> $ cc -Wall -Wextra -Werror -O3 u.c
> $ ./a.out
> Segmentation fault
> 
> 
> I get this SEGV difference consistently.  I CCed gcc@ in case they consider this 
> to be something they want to address.  Maybe the optimization is important for 
> size_t-sized indices, but if it is not, I'd prefer getting the SEGV for SIZE_MAX.
> 

After some though, of course the compiler can't produce any different code, 
since pointers are 64 bits.  A different story would be if pointers were 128 
bits, but that might cause its own issues; should sizes be still 64 bits? or 128 
bits?  Maybe using a configurable size_t would be interesting for debugging.

Anyway, it's good to know that tweaking size_t to be 32 bits in some debug 
builds might help catch some off-by-one errors.

Cheers,

Alex

-- 
<http://www.alejandro-colomar.es/>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-11-23 20:08 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-17  7:02 size_t vs long A via Libc-alpha
2022-11-17  9:21 ` Alejandro Colomar via Libc-alpha
2022-11-17  9:48   ` A via Libc-alpha
2022-11-17 11:00     ` Alejandro Colomar via Libc-alpha
2022-11-17 19:40       ` Jason Duerstock via Libc-alpha
2022-11-17 20:01         ` Alejandro Colomar via Libc-alpha
2022-11-17 19:17   ` Paul Eggert
2022-11-17 20:27     ` Alejandro Colomar via Libc-alpha
2022-11-17 21:39       ` Paul Eggert
2022-11-17 23:04         ` Alejandro Colomar via Libc-alpha
2022-11-23 20:08           ` Using size_t to crash on off-by-one errors (was: size_t vs long.) Alejandro Colomar via Libc-alpha
2022-11-18  2:11         ` size_t vs long Maciej W. Rozycki
2022-11-18  2:47           ` Paul Eggert
2022-11-23 20:01             ` Alejandro Colomar via Libc-alpha
2022-11-17 21:58 ` DJ Delorie via Libc-alpha

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).