Using size_t to crash on off-by-one errors (was: size_t vs long.)

unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Alejandro Colomar via Libc-alpha <libc-alpha@sourceware.org>
To: Paul Eggert <eggert@cs.ucla.edu>, libc-alpha@sourceware.org
Cc: gcc@gcc.gnu.org, A <amit234234234234@gmail.com>
Subject: Using size_t to crash on off-by-one errors (was: size_t vs long.)
Date: Wed, 23 Nov 2022 21:08:17 +0100	[thread overview]
Message-ID: <148dc963-1d9c-b7d8-e5bf-6843b4b36882@gmail.com> (raw)
In-Reply-To: <683baaee-f3dc-bc13-c303-8fb0df0d0a36@gmail.com>


[-- Attachment #1.1: Type: text/plain, Size: 4925 bytes --]

Hi,

On 11/18/22 00:04, Alejandro Colomar wrote:
>>> The main advantage of this code compared to the equivalent ssize_t or 
>>> ptrdiff_t or idx_t code is that if you somehow write an off-by-one error, and 
>>> manage to access the array at [-1], if i is unsigned you'll access 
>>> [SIZE_MAX], which will definitely crash your program.
>>
>> That's not true on the vast majority of today's platforms, which don't have 
>> subscript checking, and for which a[-1] is treated the same way a[SIZE_MAX] 
>> is. On my platform (Fedora 36 x86-64) the same machine code is generated for 
>> 'a' and 'b' for the following C code.
>>
>>    #include <stdint.h>
>>    int a(int *p) { return p[-1]; }
>>    int b(int *p) { return p[SIZE_MAX]; }
> 
> Hmm, this seems to be true in my platform (amd64) per the experiment I just did:
> 
> $ cat s.c
> #include <sys/types.h>
> 
> char
> f(char *p, ssize_t i)
> {
>      return p[i];
> }
> $ cat u.c
> #include <stddef.h>
> 
> char
> f(char *p, size_t i)
> {
>      return p[i];
> }
> $ cc -Wall -Wextra -Werror -S -O3 s.c u.c
> $ diff -u u.s s.s
> --- u.s    2022-11-17 23:41:47.773805041 +0100
> +++ s.s    2022-11-17 23:41:47.761805265 +0100
> @@ -1,15 +1,15 @@
> -    .file    "u.c"
> +    .file    "s.c"
>       .text
>       .p2align 4
>       .globl    f
>       .type    f, @function
>   f:
> -.LFB0:
> +.LFB6:
>       .cfi_startproc
>       movzbl    (%rdi,%rsi), %eax
>       ret
>       .cfi_endproc
> -.LFE0:
> +.LFE6:
>       .size    f, .-f
>       .ident    "GCC: (Debian 12.2.0-9) 12.2.0"
>       .section    .note.GNU-stack,"",@progbits
> 
> 
> It seems a violation of the standard, isn't it?
> 
> The operator [] doesn't have a type, and an argument to it should be treated 
> with whatever type it has after default promotions.  If I pass a size_t to it, 
> the type should be unsigned, and that should be preserved, by accessing the 
> array at a high value, which the compiler has no way to know if it will exist or 
> not, by that function definition.  The extreme of -1 and SIZE_MAX might be not 
> the best one, since we would need a pointer to be 0 to be accessible at 
> [SIZE_MAX], but if you replace those by -RANDOM, and (size_t)-RANDOM, then the 
> compiler definitely needs to generate different code, yet it doesn't.
> 
> I'm guessing this is an optimization by GCC knowing that we will never be close 
> to using the whole 64-bit address space.  If we use int and unsigned, things 
> change:
> 
> $ cat s.c
> char
> f(char *p, int i)
> {
>      return p[i];
> }
> alx@asus5775:~/tmp$ cat u.c
> char
> f(char *p, unsigned i)
> {
>      return p[i];
> }
> $ cc -Wall -Wextra -Werror -S -O3 s.c u.c
> $ diff -u u.s s.s
> --- u.s    2022-11-17 23:44:54.446318186 +0100
> +++ s.s    2022-11-17 23:44:54.434318409 +0100
> @@ -1,4 +1,4 @@
> -    .file    "u.c"
> +    .file    "s.c"
>       .text
>       .p2align 4
>       .globl    f
> @@ -6,7 +6,7 @@
>   f:
>   .LFB0:
>       .cfi_startproc
> -    movl    %esi, %esi
> +    movslq    %esi, %rsi
>       movzbl    (%rdi,%rsi), %eax
>       ret
>       .cfi_endproc
> 
> 
> I'm guessing that GCC doesn't do the assumption here, and I guess the unsigned 
> version would crash, while the signed version would cause nasal demons.  Anyway, 
> now that I'm here, I'll test it:
> 
> 
> $ cat s.c
> [[gnu::noipa]]
> char
> f(char *p, int i)
> {
>      return p[i];
> }
> 
> int main(void)
> {
>      int i = -1;
>      char c[4];
> 
>      return f(c, i);
> }
> $ cc -Wall -Wextra -Werror -O3 s.c
> $ ./a.out
> $ echo $?
> 0
> 
> 
> $ cat u.c
> [[gnu::noipa]]
> char
> f(char *p, unsigned i)
> {
>      return p[i];
> }
> 
> int main(void)
> {
>      unsigned i = -1;
>      char c[4];
> 
>      return f(c, i);
> }
> $ cc -Wall -Wextra -Werror -O3 u.c
> $ ./a.out
> Segmentation fault
> 
> 
> I get this SEGV difference consistently.  I CCed gcc@ in case they consider this 
> to be something they want to address.  Maybe the optimization is important for 
> size_t-sized indices, but if it is not, I'd prefer getting the SEGV for SIZE_MAX.
> 

After some though, of course the compiler can't produce any different code, 
since pointers are 64 bits.  A different story would be if pointers were 128 
bits, but that might cause its own issues; should sizes be still 64 bits? or 128 
bits?  Maybe using a configurable size_t would be interesting for debugging.

Anyway, it's good to know that tweaking size_t to be 32 bits in some debug 
builds might help catch some off-by-one errors.

Cheers,

Alex

-- 
<http://www.alejandro-colomar.es/>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

next prev parent reply	other threads:[~2022-11-23 20:08 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-17  7:02 size_t vs long A via Libc-alpha
2022-11-17  9:21 ` Alejandro Colomar via Libc-alpha
2022-11-17  9:48   ` A via Libc-alpha
2022-11-17 11:00     ` Alejandro Colomar via Libc-alpha
2022-11-17 19:40       ` Jason Duerstock via Libc-alpha
2022-11-17 20:01         ` Alejandro Colomar via Libc-alpha
2022-11-17 19:17   ` Paul Eggert
2022-11-17 20:27     ` Alejandro Colomar via Libc-alpha
2022-11-17 21:39       ` Paul Eggert
2022-11-17 23:04         ` Alejandro Colomar via Libc-alpha
2022-11-23 20:08           ` Alejandro Colomar via Libc-alpha [this message]
2022-11-18  2:11         ` Maciej W. Rozycki
2022-11-18  2:47           ` Paul Eggert
2022-11-23 20:01             ` Alejandro Colomar via Libc-alpha
2022-11-17 21:58 ` DJ Delorie via Libc-alpha

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/libc/involved.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=148dc963-1d9c-b7d8-e5bf-6843b4b36882@gmail.com \
    --to=libc-alpha@sourceware.org \
    --cc=alx.manpages@gmail.com \
    --cc=amit234234234234@gmail.com \
    --cc=eggert@cs.ucla.edu \
    --cc=gcc@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).