unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* RFC: *scanf vs. overflow
@ 2020-05-22 20:59 Eric Blake via Libc-alpha
  2020-05-23  1:16 ` Rich Felker
  2020-05-23  7:06 ` Richard W.M. Jones via Libc-alpha
  0 siblings, 2 replies; 11+ messages in thread
From: Eric Blake via Libc-alpha @ 2020-05-22 20:59 UTC (permalink / raw)
  To: glibc list; +Cc: Florian Weimer, libguestfs@redhat.com

It has long been known that the C specification of *scanf() leaves 
behavior undefined for things like
int i;
sscanf("9999999999999999", "%i", &i);

C11 7.21.6.2 P12
"Matches an optionally signed integer, whose format is the same as 
expected for the subject sequence of the strtol function with the value 
0 for the base argument."
C11 7.21.6.2 P10
"If this object does not have an appropriate type, or if the result of 
the conversion cannot be represented in the object, the behavior is 
undefined."

as there is an overflow when consuming the input which matches the 
strtol subject sequence but does not fit in the width of an int.  On my 
Linux system, 'man sscanf' mentions that ERANGE might be set in such a 
case, but neither C nor POSIX actually requires this behavior; other 
likely behaviors is storing the value mod 2^32 into i, or storing 
INT_MAX into i, or ...

This is annoying - the only safe way to parse integers from 
untrustworthy sources, where overflow MUST be detected, is to manually 
open-code strtol() calls, which can get quite lengthy in comparison to 
the concise representations possible with *scanf.

Would glibc be willing to consider a GNU extension to add an optional 
flag character between '%' and the various numeric conversion specifiers 
(both integral based on strto*l, and floating point based on strtod), 
where we could force *scanf to treat numeric overflow as a matching 
failure, rather than undefined behavior?  Or even a second flag to 
request that printf stop consuming characters if the next character in 
input would cause overflow in the current specifier, leaving that 
character to instead be matched to the remainder of the format string?

Let's suppose for arguments that we add '^' as a request to force 
overflow to be a matching error.  Then sscanf("9999999999999999", "%^i", 
&i) would be well-specified to return 0, rather than returning 1 with an 
unknown value assigned into i or any other behavior that other libc do 
with the undefined behavior when the ^ is not present.

And if glibc likes the idea of such an extension, and we see an uptick 
in applications actually using it, I'd also be happy to champion the 
addition of such an extension in POSIX (but the POSIX folks will 
definitely want to see existing practice first - both an implementation 
and applications that use that implementation).  The libguestfs suite of 
programs is willing to be an early adopter, if glibc is willing to 
pursue adding such a safety valve.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC: *scanf vs. overflow
  2020-05-22 20:59 RFC: *scanf vs. overflow Eric Blake via Libc-alpha
@ 2020-05-23  1:16 ` Rich Felker
  2020-05-23  3:06   ` Paul Eggert
  2020-05-23  7:06 ` Richard W.M. Jones via Libc-alpha
  1 sibling, 1 reply; 11+ messages in thread
From: Rich Felker @ 2020-05-23  1:16 UTC (permalink / raw)
  To: Eric Blake; +Cc: Florian Weimer, glibc list, libguestfs@redhat.com

On Fri, May 22, 2020 at 03:59:14PM -0500, Eric Blake via Libc-alpha wrote:
> It has long been known that the C specification of *scanf() leaves
> behavior undefined for things like
> int i;
> sscanf("9999999999999999", "%i", &i);
> 
> C11 7.21.6.2 P12
> "Matches an optionally signed integer, whose format is the same as
> expected for the subject sequence of the strtol function with the
> value 0 for the base argument."
> C11 7.21.6.2 P10
> "If this object does not have an appropriate type, or if the result
> of the conversion cannot be represented in the object, the behavior
> is undefined."
> 
> as there is an overflow when consuming the input which matches the
> strtol subject sequence but does not fit in the width of an int.  On
> my Linux system, 'man sscanf' mentions that ERANGE might be set in
> such a case, but neither C nor POSIX actually requires this
> behavior; other likely behaviors is storing the value mod 2^32 into
> i, or storing INT_MAX into i, or ...
> 
> This is annoying - the only safe way to parse integers from
> untrustworthy sources, where overflow MUST be detected, is to
> manually open-code strtol() calls, which can get quite lengthy in
> comparison to the concise representations possible with *scanf.
> 
> Would glibc be willing to consider a GNU extension to add an
> optional flag character between '%' and the various numeric
> conversion specifiers (both integral based on strto*l, and floating
> point based on strtod), where we could force *scanf to treat numeric
> overflow as a matching failure, rather than undefined behavior?  Or
> even a second flag to request that printf stop consuming characters
> if the next character in input would cause overflow in the current
> specifier, leaving that character to instead be matched to the
> remainder of the format string?

Since conversion specifier forms outside the standard *also* have
undefined behavior, I see no advantage to defining that particular
undefined case vs just defining the result of the overflowing
conversion, unless you're worried the standard might later define a
conflicting definition. Neither way is amenable to configure detection
(without breaking cross compiling) without also adopting something
like my proposal on libc-coord:
https://www.openwall.com/lists/libc-coord/2020/04/22/1

BTW there is a portable only-somewhat-hideous way to do this with
sscanf: using assignment suppression combined with %n, then strtol,
etc. with the offsets sproduced by %n.

> Let's suppose for arguments that we add '^' as a request to force
> overflow to be a matching error.  Then sscanf("9999999999999999",
> "%^i", &i) would be well-specified to return 0, rather than
> returning 1 with an unknown value assigned into i or any other
> behavior that other libc do with the undefined behavior when the ^
> is not present.
> 
> And if glibc likes the idea of such an extension, and we see an
> uptick in applications actually using it, I'd also be happy to
> champion the addition of such an extension in POSIX (but the POSIX
> folks will definitely want to see existing practice first - both an
> implementation and applications that use that implementation).  The
> libguestfs suite of programs is willing to be an early adopter, if
> glibc is willing to pursue adding such a safety valve.

I think it would be more useful to look for existing practice where
the UB blows up in horrible ways, and if there is none (if all
implementations behave somewhat reasonably) define the intersection of
their behaviors as standard and get rid of the UB here. A new feature
will not reliably be usable for decades in portable software, but new
documentation of existing universal practice would be immediately
usable.

Rich

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC: *scanf vs. overflow
  2020-05-23  1:16 ` Rich Felker
@ 2020-05-23  3:06   ` Paul Eggert
  2020-05-23 16:11     ` Rich Felker
  0 siblings, 1 reply; 11+ messages in thread
From: Paul Eggert @ 2020-05-23  3:06 UTC (permalink / raw)
  To: Rich Felker, Eric Blake; +Cc: Florian Weimer, glibc list, libguestfs@redhat.com

On 5/22/20 6:16 PM, Rich Felker wrote:
> A new feature
> will not reliably be usable for decades in portable software, but new
> documentation of existing universal practice would be immediately
> usable.

We could do both.

Also, we could change glibc's behavior in a simpler way, by not adding a new
flag; but if an integer is out of range, then scan only the initial prefix that
fits, leaving the trailing digits for the rest of the format to scan. This also
conforms to POSIX and is more likely to cause C programs to do the right thing
(i.e., report a failure) than the current behavior does. And with luck perhaps
we could eventually get POSIX to standardize this behavior.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC: *scanf vs. overflow
  2020-05-22 20:59 RFC: *scanf vs. overflow Eric Blake via Libc-alpha
  2020-05-23  1:16 ` Rich Felker
@ 2020-05-23  7:06 ` Richard W.M. Jones via Libc-alpha
  2020-05-23 15:25   ` Paul Eggert
  2020-05-23 16:21   ` Rich Felker
  1 sibling, 2 replies; 11+ messages in thread
From: Richard W.M. Jones via Libc-alpha @ 2020-05-23  7:06 UTC (permalink / raw)
  To: Eric Blake; +Cc: Florian Weimer, glibc list, libguestfs@redhat.com

The context to this is that nbdkit uses sscanf to parse simple file
formats in various places, eg:

https://github.com/libguestfs/nbdkit/blob/b23f4f53cf71326f1dba481f64f7f182c20fa3dc/plugins/data/format.c#L171-L172
https://github.com/libguestfs/nbdkit/blob/b23f4f53cf71326f1dba481f64f7f182c20fa3dc/filters/ddrescue/ddrescue.c#L98

We can only do this safely where we can prove that overflow does not
matter.  In other cases we've had to change sscanf uses to strto* etc
which is much more difficult to use correctly.  Just look at how much
code is required to wrap strto* functions to use them safely:

https://github.com/libguestfs/nbdkit/blob/b23f4f53cf71326f1dba481f64f7f182c20fa3dc/server/public.c#L113-L296

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC: *scanf vs. overflow
  2020-05-23  7:06 ` Richard W.M. Jones via Libc-alpha
@ 2020-05-23 15:25   ` Paul Eggert
  2020-05-23 16:21   ` Rich Felker
  1 sibling, 0 replies; 11+ messages in thread
From: Paul Eggert @ 2020-05-23 15:25 UTC (permalink / raw)
  To: Richard W.M. Jones, Eric Blake
  Cc: Florian Weimer, glibc list, libguestfs@redhat.com

On 5/23/20 12:06 AM, Richard W.M. Jones via Libc-alpha wrote:

> https://github.com/libguestfs/nbdkit/blob/b23f4f53cf71326f1dba481f64f7f182c20fa3dc/plugins/data/format.c#L171-L172

> We can only do this safely where we can prove that overflow does not
> matter.

Yes, this is exactly the sort of usage that I had in mind. In the following
example, which is the first use of *scanf I saw, if scanf never allowed integer
overflow (that is, it scanned only as much of a number that would fit), this
code would output an error message instead of blithely going on with an
overflowed number, and this would be safer than the code's current behavior.

>       if (sscanf (&value[i], "*%" SCNi64 "%n", &k, &n) == 1) {
>         if (k < 0) {
>           nbdkit_error ("data parameter *N must be >= 0");
>           return -1;
>         }
>         ...
>       } else {
>         nbdkit_error ("')' in data string not followed by '*'");
>         return -1;
>       }

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC: *scanf vs. overflow
  2020-05-23  3:06   ` Paul Eggert
@ 2020-05-23 16:11     ` Rich Felker
  2020-05-23 16:28       ` Paul Eggert
  0 siblings, 1 reply; 11+ messages in thread
From: Rich Felker @ 2020-05-23 16:11 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Florian Weimer, glibc list, Eric Blake, libguestfs@redhat.com

On Fri, May 22, 2020 at 08:06:34PM -0700, Paul Eggert wrote:
> On 5/22/20 6:16 PM, Rich Felker wrote:
> > A new feature
> > will not reliably be usable for decades in portable software, but new
> > documentation of existing universal practice would be immediately
> > usable.
> 
> We could do both.
> 
> Also, we could change glibc's behavior in a simpler way, by not adding a new
> flag; but if an integer is out of range, then scan only the initial prefix that
> fits, leaving the trailing digits for the rest of the format to scan. This also
> conforms to POSIX and is more likely to cause C programs to do the right thing
> (i.e., report a failure) than the current behavior does. And with luck perhaps
> we could eventually get POSIX to standardize this behavior.

I'm not really a fan of stopping on an initial prefix. While UB allows
anything, that's contrary to the abstract behavior defined for scanf
(matching fields syntactically then value conversion) and does not
admit easily sharing a backend with strto*. It's also even *more
likely* to break programs that don't expect the behavior than just
storing a wrapped or clamped value, since all the remaining fields
will misalign with the conversion specifier string.

FILE-based (as opposed to string-based) scanf forms inherently do not
admit any kind of "recovery" after mismatch without the caller seeking
backwards (requiring a seekable stream); many of them are lossy on
error. This is mainly a reaon not to use them, not a justification for
a weird definition for one special case.

I'm pretty sure the real answer here is just "don't use *scanf for
that."

Rich

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC: *scanf vs. overflow
  2020-05-23  7:06 ` Richard W.M. Jones via Libc-alpha
  2020-05-23 15:25   ` Paul Eggert
@ 2020-05-23 16:21   ` Rich Felker
  1 sibling, 0 replies; 11+ messages in thread
From: Rich Felker @ 2020-05-23 16:21 UTC (permalink / raw)
  To: Richard W.M. Jones
  Cc: Florian Weimer, glibc list, Eric Blake, libguestfs@redhat.com

On Sat, May 23, 2020 at 08:06:54AM +0100, Richard W.M. Jones via Libc-alpha wrote:
> The context to this is that nbdkit uses sscanf to parse simple file
> formats in various places, eg:
> 
> https://github.com/libguestfs/nbdkit/blob/b23f4f53cf71326f1dba481f64f7f182c20fa3dc/plugins/data/format.c#L171-L172
> https://github.com/libguestfs/nbdkit/blob/b23f4f53cf71326f1dba481f64f7f182c20fa3dc/filters/ddrescue/ddrescue.c#L98
> 
> We can only do this safely where we can prove that overflow does not
> matter.

Being that it's specified as UB, it can never "not matter";
arbitrarily bad side effects are permitted. So it's only safe to use
scanf where the input is *trusted not to contain overflowing values*.

What would be really nice to fix here is getting the standard to
specify that overflow has behavior like strto* or at least
"unspecified value" rather than "undefined behavior" so that it's safe
to let it overflow in cases where you don't care (e.g. you'll be
consistency-checking the value afterwards anyway).

> In other cases we've had to change sscanf uses to strto* etc
> which is much more difficult to use correctly.  Just look at how much
> code is required to wrap strto* functions to use them safely:
> 
> https://github.com/libguestfs/nbdkit/blob/b23f4f53cf71326f1dba481f64f7f182c20fa3dc/server/public.c#L113-L296

Really that much code is just for the sake of verbose error messages,
and they're not even accurate. "errno!=0" does not mean "could not
parse"; it can also be overflow of a perfectly parseable value. And if
you've already caught errno!=0 then end==str is impossible (dead
code). The last case, not hitting null, is also likely spurious/wrong;
you usually *want* to pick up where strto* stopped, and the next thing
the parser does will catch whether the characters after the number are
valid there or not.

strto* do have some annoying design flaws in error reporting, but
they're not really that hard to use right, and much easier than scanf
which just *lacks the reporting channels* for the kind of fine-grained
error reporting you're insisting on doing here.

FWIW this code would also be a lot cleaner as a static inline function
rather than a many-line macro.

Rich

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC: *scanf vs. overflow
  2020-05-23 16:11     ` Rich Felker
@ 2020-05-23 16:28       ` Paul Eggert
  2020-05-23 16:45         ` Rich Felker
  0 siblings, 1 reply; 11+ messages in thread
From: Paul Eggert @ 2020-05-23 16:28 UTC (permalink / raw)
  To: Rich Felker; +Cc: Florian Weimer, glibc list, Eric Blake, libguestfs@redhat.com

On 5/23/20 9:11 AM, Rich Felker wrote:

> stopping on an initial prefix ... does not admit easily sharing a backend with strto*.

I don't see why. If the backend has a "stop scanning on integer overflow" flag
(which it would need to have anyway, to support the proposed behavior), then
*scanf can use the flag and strto* can not use it.

Anyway, this is not an issue for glibc, which has no such backend.

> that's contrary to the abstract behavior defined for scanf
> (matching fields syntactically then value conversion)

That's not really a problem. The abstract behavior already provides for matching
that is not purely syntactic. For example, string conversion specifiers can
impose length limits on the match, which means the matching does not rely purely
on the syntax of the input. It would be easy to say that integer conversion
specifiers can also impose limits related to integer overflow.

> It's also even *more
> likely* to break programs that don't expect the behavior than just
> storing a wrapped or clamped value

That's not true of the code that I looked at (see the URLs earlier in this
thread). That code was pretty carefully written and yet still vulnerable to the
integer-overflow issue.

> I'm pretty sure the real answer here is just "don't use *scanf for
> that."

Absolutely true right now. We are merely talking about (a) what sort of
implementation behavior is more useful for programs that are currently relying
on undefined behavior, and (b) what might be the cleanest addition to POSIX
later, to help improve this mess so that future programmers can use *scanf
safely in more situations.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC: *scanf vs. overflow
  2020-05-23 16:28       ` Paul Eggert
@ 2020-05-23 16:45         ` Rich Felker
  2020-05-23 17:18           ` Paul Eggert
  2020-05-26  9:30           ` [Libguestfs] " Richard W.M. Jones via Libc-alpha
  0 siblings, 2 replies; 11+ messages in thread
From: Rich Felker @ 2020-05-23 16:45 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Florian Weimer, glibc list, Eric Blake, libguestfs@redhat.com

On Sat, May 23, 2020 at 09:28:26AM -0700, Paul Eggert wrote:
> On 5/23/20 9:11 AM, Rich Felker wrote:
> 
> > stopping on an initial prefix ... does not admit easily sharing a backend with strto*.
> 
> I don't see why. If the backend has a "stop scanning on integer overflow" flag
> (which it would need to have anyway, to support the proposed behavior), then
> *scanf can use the flag and strto* can not use it.
> 
> Anyway, this is not an issue for glibc, which has no such backend.

It's relevant because you want to propose this for standardization.

> > that's contrary to the abstract behavior defined for scanf
> > (matching fields syntactically then value conversion)
> 
> That's not really a problem. The abstract behavior already provides for matching
> that is not purely syntactic. For example, string conversion specifiers can
> impose length limits on the match, which means the matching does not rely purely
> on the syntax of the input. It would be easy to say that integer conversion
> specifiers can also impose limits related to integer overflow.

Sure that's syntax. It's /[^ ]{1,n}"/.

Of course for integers you can define a syntax that matches every
non-overflowing value (this is always true for finite matching sets),
but that's nothing like how the function is specified and I don't
think anyone reasonable would classify non-overflow as a syntactic
property.

> > It's also even *more
> > likely* to break programs that don't expect the behavior than just
> > storing a wrapped or clamped value
> 
> That's not true of the code that I looked at (see the URLs earlier in this
> thread). That code was pretty carefully written and yet still vulnerable to the
> integer-overflow issue.

I don't follow. *Any* use of scanf on untrusted input is "vulnerable
to the integer-overflow issue" in the sense that overflow is UB. This
is not something subtle.

If you mean actually using overflowed values in an unsafe way
(assuming no ballooning effects of UB, just wrong values), I don't see
how it's subtle either. Any value that could be produced via overflow
could also be produced via non-overflowing input, and you have to
validate data either way.

> > I'm pretty sure the real answer here is just "don't use *scanf for
> > that."
> 
> Absolutely true right now. We are merely talking about (a) what sort of
> implementation behavior is more useful for programs that are currently relying
> on undefined behavior, and (b) what might be the cleanest addition to POSIX
> later, to help improve this mess so that future programmers can use *scanf
> safely in more situations.

This is absolutely not "clean" and I am opposed to it.

Rich

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC: *scanf vs. overflow
  2020-05-23 16:45         ` Rich Felker
@ 2020-05-23 17:18           ` Paul Eggert
  2020-05-26  9:30           ` [Libguestfs] " Richard W.M. Jones via Libc-alpha
  1 sibling, 0 replies; 11+ messages in thread
From: Paul Eggert @ 2020-05-23 17:18 UTC (permalink / raw)
  To: Rich Felker; +Cc: Florian Weimer, glibc list, Eric Blake, libguestfs@redhat.com

On 5/23/20 9:45 AM, Rich Felker wrote:

> It's relevant because you want to propose this for standardization.

I don't think it's ready for standardization now. I'm merely proposing it for
glibc. If it works well there, great; if not, then glibc should do something
more ambitious, such as Eric's proposal.

Doing nothing is not a good option; this is a real problem that affects many
real programs.

> that's syntax. It's /[^ ]{1,n}"/.

I'll concede that. That being said, the difference between syntax and semantics
is always somewhat arbitrary, and there's little point to slicing and dicing the
fuzzy boundary here. Regardless of whether the change is to "syntax" or
"semantics" it would be easy to change POSIX to allow the proposed behavior;
it's not a fundamental change to the spec.

> *Any* use of scanf on untrusted input is "vulnerable
> to the integer-overflow issue" in the sense that overflow is UB.

Absolutely, but that was not the point. The issue is what's the best thing to do
for these programs. Many carefully-written but imperfect programs have these
bugs, and although glibc is within its rights to dump core or worse when these
programs run, it's better if glibc's behavior improves their overall
reliability. Letting these errors run through silently causing further
corruption later is not a good strategy for improving overall reliability.
Pointing fingers at developers and telling them not to use scanf is not likely
to be a good strategy either.
> Any value that could be produced via overflow
> could also be produced via non-overflowing input, and you have to
> validate data either way.

Sure, but silently treating 2**32 as if it were zero is more likely to cause
real problems later. We need a better way for *scanf to reflect these errors
back to the calling code.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Libguestfs] RFC: *scanf vs. overflow
  2020-05-23 16:45         ` Rich Felker
  2020-05-23 17:18           ` Paul Eggert
@ 2020-05-26  9:30           ` Richard W.M. Jones via Libc-alpha
  1 sibling, 0 replies; 11+ messages in thread
From: Richard W.M. Jones via Libc-alpha @ 2020-05-26  9:30 UTC (permalink / raw)
  To: Rich Felker; +Cc: glibc list, libguestfs@redhat.com

On Sat, May 23, 2020 at 12:45:01PM -0400, Rich Felker wrote:
> I don't follow. *Any* use of scanf on untrusted input is "vulnerable
> to the integer-overflow issue" in the sense that overflow is UB. This
> is not something subtle.

{,s}scanf is a useful, natural way to parse strings, and strto* is a
horrible interface with many bear traps.  It seems to me scanf could
be changed to make it safe for overflow, simply by stopping parsing at
the point where the overflow occurs and returning a short count (or
the various other ideas suggested already in this thread).

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-05-26  9:30 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-22 20:59 RFC: *scanf vs. overflow Eric Blake via Libc-alpha
2020-05-23  1:16 ` Rich Felker
2020-05-23  3:06   ` Paul Eggert
2020-05-23 16:11     ` Rich Felker
2020-05-23 16:28       ` Paul Eggert
2020-05-23 16:45         ` Rich Felker
2020-05-23 17:18           ` Paul Eggert
2020-05-26  9:30           ` [Libguestfs] " Richard W.M. Jones via Libc-alpha
2020-05-23  7:06 ` Richard W.M. Jones via Libc-alpha
2020-05-23 15:25   ` Paul Eggert
2020-05-23 16:21   ` Rich Felker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).