In my experience, the worst part of the C standard library is not its existence, but the fact that so many developers insist on slavishly using it directly, instead of safer wrappers.
for(int i = 0; i < len(characters); i++)
{
if(characters[i]-48 <= 9 && characters[i]-48 >= 0)
{
ret = ret * 10 + characters[i] - 48;
}
else
{
return ERROR;
}
}
return ret;
Adjust until it actually works, but you get the picture.the author admits you can parse signed integers in their second example, but for unsigned, they don't like seem to like that unsigned parsing will accept negative numbers and then automatically wrap them to their unsigned equivalents, nor do they like that C number parsing often bails with best effort on non-numeric trailing data rather than flagging it an error, nor do they like that ULONG_MAX is used as a sentinel value by sscanf.
I'm not sure what they mean by "output raw" vs "output"
$ cat t.c
#include <stdlib.h>
#include <math.h>
#include <stdio.h>
int main(int argc, char \* argv){
char * enda = NULL;
unsigned long long a = strtoull("-18446744073709551614", &enda, 10);
printf("in = -18446744073709551614, out = %llu\n", a);
char * endb = NULL;
unsigned long long b = strtoull("-18446744073709551615", &endb, 10);
printf("in = -18446744073709551615, out = %llu\n", b);
return 0;
}
$ gcc t.c
$ ./a.out
in = -18446744073709551614, out = 2
in = -18446744073709551615, out = 1
$
I get their "output raw" value. I don't know what their "output" value is coming from.I don't see anywhere they describe what they are representing in the raw vs not columns.
That's right. I don't like asking it to parse the number contained inside a string, and getting a different number as a result.
That's just simply not the right answer.
> I'm not sure what they mean by "output raw" vs "output"
I can see how that's very unclear. Changed now to "Readable".
As you can read at https://en.wikipedia.org/wiki/Errno.h errno is barely used by the C standard (though defined there). It is rather POSIX that uses errno very encompassingly. For example the WinAPI functions use a much more sensible way to report errors (and don't make use of errno).
if(characters[i] <= '9' && characters[i] >= '0')
{
ret = ret * 10 + characters[i] - '0';
}EDIT: perhaps I should have been clearer; by not having one early on, we now have multiple competing package managers, with no clear winner. Responses prove that point.
Yes, the standard library is bad. This is by far the worst part of the C legacy. But it is not that hard to write your own.
String functions like this are not difficult at all, and you can use better naming and semantics, write faster code etc.
C is not the C standard library, ffs.
The distinction between a language and its standard library gets blurry even in theory, and in practice they're nearly inseparable. If a language's standard library has four ways of doing almost the same thing, and they're all fundamentally broken, that's a problem.
Complete BS in my opinion.
Bonus points for having bespoke linting rules to point out the use of known “bad” functions.
In one old project we went through and replaced all instances of sprintf() with snprintf() or equivalent. Once we were happy that we’d got every occurrence we could then add lint rules to flag up any new use of sprintf() so that devs didn’t introduce new possible problems into the code.
(Obviously you can still introduce plenty of problems with snprintf() but we learned to give that more scrutiny.)
There is a hashmap implementation though: https://man7.org/linux/man-pages/man3/hsearch.3.html
(In fact, looking at it again, I assume I'd purposely purged it from my memory given how terrible it is.)
The non-extensible nature is the biggest one. There are plenty of times when the maximum number of elements needed to be stored will be known in advance. (See the note about hcreate().)
Secondly the hserach() implementation requires the keys to be NUL terminated strings since "the same key" is determined using strcmp(). Good luck if you want to use a number, pointer, arbitrary structure or anything else as a key.
Any reasonable hash table implementation would not have either of these limitations.
Maybe I needed to say:
> > like lists/hashmaps/etc which neither C nor the standard libraries provide
... reasonable implementations of.
Similar to how strlcpy() is not a slam dunk fix to the strcpy() problem.
If someone uses sprintf() you have to go faffing around to check whether they've thought about the destination buffer size. The size of the structure may be buried far away through several layers of other APIs/etc.
Using snprintf() doesn't solve this in any way, but checking whether the new use of snprintf() checks the return value is relatively simple. Again, there's still no guarantee that there aren't other problems with snprintf() but, in our experience, we found that once people were forced to use it over sprintf() and had things checked in PR reviews we found that the number of instances of misuse dropped dramatically.
It wasn't the switch of functions that reduced the number of problems we saw, but the outright banning of the known footgun `sprintf()` and the careful auditing and replacement of it with `snprintf()` that served as a whole load of reference copies for how to use it. We spread the work of replacing `sprintf()` around the team so that everyone got to do some of the switches and everyone got to review the changes. And we found a whole load of possible problems (most of which were very unlikely to ever lead to a crash or corruption.)
The same would apply if you picked any other known footgun and did similar refactoring/rewrites/auditing/etc.
Anyway, I haven't done C commercially/professionally for about 5 years now. I do miss it though.
The same code can be compiled for different platforms, yes, but the assembly and machine code will vary significantly, so it could behave differently. Porting to a new platform was usually a very complex process, but the code produced was efficient. Nobody seems to care about this nowadays, though, it seems.
People expect numbers to support specific ranges and it is fine to define the data types numerically rather than as a concrete bit pattern, but C just takes the cake.
Char is at least 8 bits, short is at least 16 bits, int is at least as big as short (genius idea), long is at least 32 bits, long long is at least 64 bits.
The point of "int" is to be the integer equivalent of size_t and therefore be of word size.
But nobody uses int like that. Everyone assumes it's a 32 bit datatype when it isn't.
The use case where you port existing C code to a microcontroller is extremely unappealing, because the number range gets changed under your feet. When I've had to work on embedded software everyone just used int8_t, int16_t, int32_t, int64_t for portability instead.
The criticisms related to UB are not about understanding the target platform and the target compiler's behavior. Undefined Behavior is not the same thing as Implementation-defined Behavior, and lots of folks (including me) would be satisfied with reclassifying chunks of UB as the latter.
The behavior of the target platform isn't really the issue. C23 mandates two's complement for signed integers. Most hardware wraps on overflow, but that literally doesn't matter. The standard says a program exhibiting signed overflow is undefined, period.
In practice, UB rules mean the compiler is free to remove checks for signed overflow/underflow, checks for null pointers, etc. This can and does happen. Man, just a few weeks ago, I just had to deal with a crash in a C program that turned out to be due to the compiler removing a null check. That was a painful one.
The what now? Though not lately, I did program in C for 15 years and never seen something like this. I did see some compiler bugs on obscure platforms (SINIX, IRIX, HPUX on Itanium64, etc.) with proprietary compilers, this kind of thing would make really get me shouting.
Were you able to determine why the compiler did this? Is it a bug in the compiler?
Compilers keep taking more and more advantage of inferring that a values in variables cannot be `x`, because if it were than some previous usage would have been UB. When people file bugs to complain, the compiler authors point at the spec which allows them to assume that UB behavior never happens, so the compiler behavior is legal. The only counterargument is if the compiler has chosen to document some specific behavior for this UB (possibly only with specific flags enabled) in which case the compiler testing that scenario as proof of impossibility is indeed a bug (when the required flags are set).
Like… edge cases? It's parsing a number! We're not talking about I/O on hard vs soft intr NFS mounts, here. There's a right answer.
strlen(), on valid null terminated strings, doesn't come with caveats like "oh we can't measure strings of length 99".
But sure, C is turing complete. It is possible to solve any problem a turing machine can solve.
> understand the target platform and the target compiler’s behavior.
This is neither. This is purely the language.
So, when you say, "it's purely the language", I have to disagree. The language means different things on different platforms but it's still defined exactly on the target platform. And it's efficient on that platform.
Nowadays, we prefer correct vs. efficient, which I do agree with, of course. But, I also understand why C is like it is. It is possible to claim it's a problem of the language but I would argue that it is not. C gives us barebones and working with it we have to know this. If that's not needed then sure, other languages will be easier to work with.
The C standard defines only its abstract machine, not actual hardware.
> The language means different things on different platforms but it's still defined exactly on the target platform
It's implemented to support a target platform, so that programs behave as if they ran on the abstract machine.
It'd be nice if we could move more stuff from UB to implementation defined.
Do keep in mind that target platform can change, in this regard. E.g. IIRC OpenBSD doesn't guarantee the ABI backward compatibility that Linux does, and can change things like size of int if they want, between versions.
> I also understand why C is like it is
Yup. It can be true that I understand why, and still understand that it's 2026.
Ugly (and not performant if in a hot path) but it works.
Edit: https://doc.rust-lang.org/src/core/num/mod.rs.html#1537
interesting! It boils down to this
pub const fn from_ascii_radix(src: &[u8], radix: u32) -> Result<u32, ParseIntError> {
use self::IntErrorKind::*;
use self::ParseIntError as PIE;
// guard: radix must be 2..=36
if 2 > radix || radix > 36 {
from_ascii_radix_panic(radix);
}
if src.is_empty() {
return Err(PIE { kind: Empty });
}
// Strip leading '+' or '-', detect sign
// (a bare '+' or '-' with nothing after it is an error)
// accumulate digits, checking for overflow
Ok(result)
}But it's not hard at all. It's not even as full of small issues that you can't handle the load, like dates. It's just annoying as hell.
The problem is exclusive to C and C++. It's created by the several rounds of standardization of broken behavior.
Ok, having a method to do that for you would be nice, but the post reads like it's an issue that std library doesn't provide you with a method behaving as you exactly want
> The string may begin with an arbitrary amount of whitespace (as determined by isspace(3))
Second is that it only applies to signed long long, not unsigned.
:)
That should be opt-in via a flag, if it needs to be supported at all. Unix file permissions are the only deliberate use of octal I've ever seen.
And octal is more convenient for output via 7-segment LEDs and for input via numeric keypads.
"You have to get lucky every time. We only have to get lucky once".
Precision and exactitude and formally proven correct software can exist in some problem domains, and it's kind of silly to not achieve that when it's achievable.
Fix your inputs.
We agree that the program should exit early. I think we agree it should do it cleanly and intentionally. I'm adding the constraint that "crash" doesn't necessarily mean "cleanly and intentionally", especially when talking about a C program.
I.e. either intentionally (e.g. tripping an assertion failure), or accidentally due to some logic-failure in exception/error-handling, the process ends up calling the exit(3) syscall without first having run its libc at_exit finalizers that a clean exit(2) would run; or, at a slightly higher runtime abstraction level, the process calls exit(2) or returns from main(), without having run through the appropriate RAII destructors (in C++/Rust), or gracefully signalled will-shutdown to managed threads to allow them to run terminating-state code (in Java/Go/Erlang/Win32/etc), or etc.
This kind of "hard abort" often truncates logging output at the point of abort; leaves TCP connections hanging open; leaves lockfiles around on disk; and has the potential to corrupt any data files that were being written to. Basically, it results in the process not executing "should always execute" code to clean up after itself.
So, although the OS kernel/scheduler thinks everything went fine, and that it didn't have to step in to forcibly terminate the process's lifecycle (though it did very likely observe a nonzero process exit code), I think most people would still generally call this type of abort a "crash." The process's runtime got into an invalid/broken state and stopped cleaning up, even if the process itself didn't violate any protection rules / resource limits / etc.
I think that that's by far the dominant usage of crash. It would surprise me if someone used the word crash but intended to exclude panics, etc.
Returning an error on inputs that are too long (for some definition of it) is the way to go.
This is not
./program first_number second_numberNow that you mention it, if the assignment had called for arguments, instead of files or pipes, argv points to a writable array, so the result could be written directly to it, negating any need to allocate memory, and any out-of-memory conditions from large input data would occur before the program is even called.
If it usually uses a file to store the numbers, the same could be done by writing the result back to the file, but that only works if it is passed as an argument, as piping it would throw a seek error. I wonder if the instructor would accept an interleaved little-endian input syntax, with a little-endian output; then the program could use pipes without a need to seek. An infinite series of '9' characters would output an '8', followed by one '9' per two input characters.
int stdin_atoi() {
int i = 0;
while (1) {
int c = getchar();
if (c >= '0' && c <= '9') {
i = i * 10 + (c - '0');
} else { break; }
}
return i;
}Would make an excellent “interview question from Hell”!
For strtoul and friends, maybe? 7.24.1 is pretty dense, but the key parts are "the expected form of the subject sequence is a sequence of letters and digits representing an integer with the radix specified by base, optionally preceded by a plus or minus sign […] If the correct value is outside the range of representable values […] ULONG_MAX […] is returned".
So the "expected form" allows a minus sign, but then it's clearly "outside the range of representable values" for strtoul to try parsing a negative value. So maybe it should return ULONG_MAX on those.
So arguably a minus sign present could already be treated as an error, and still be standard compliant. Unless I'm misreading.
It’s more fun when the result can be signed though. Maybe strcmp with the representation of the LONG_MAX, and if it doesn’t match, call strtol and watch for a LONG_MAX indicating an error.
C is a bit messy. Would be nicer to return a struct with a possible error and the desired value, Golang style.
So catch 22. You can only check for valid numbers if the number is valid?
#include <stdio.h>
int main(int argc, char **argv) {
if (argc != 2) {
fprintf(stderr, "usage: require one numeric argument");
}
char *nump = argv[1];
unsigned neg = 0;
unsigned long long ures = 0;
if (*nump == '-') {
neg = 1;
nump = nump + 1;
}
if (!*nump) {
fprintf(stderr, "require non empty string\n");
return 1;
}
char b;
while (b = *nump++) {
if (b >= '0' && b <= '9') {
unsigned long long nres = (ures * 10) + (b - '0');
if (nres < ures) {
fprintf(stderr, "overflow in '%s'\n", argv[1]);
return 1;
}
ures = nres;
} else {
if (b >= ' ') {
fprintf(stderr, "invalid char '%c' in '%s'\n", b, argv[1]);
} else {
fprintf(stderr, "invalid byte '%d' in '%s'\n", b, argv[1]);
}
return 1;
}
}
long long res = (long long) ures;
if (neg) {
if (ures <= 0x8000000000000000ULL) {
res = -res;
} else {
fprintf(stderr, "underflow in '%s'\n", argv[1]);
return 1;
}
} else if (ures > 0x7FFFFFFFFFFFFFFFULL) {
fprintf(stderr, "overflow in '%s'\n", argv[1]);
return 1;
}
fprintf(stdout, "result: %lld\n", res);
return 0;
} $ clang parseint.c -fsanitize=undefined -O0 -g -o parseint
$ ./parseint -9223372036854775808
parseint.c:38:23: runtime error: negation of -9223372036854775808 cannot be represented in type 'long long'; cast to an unsigned type to negate this value to itself
result: -9223372036854775808
edit: this is just to show that getting undefined behavior right is hard!In every language, the standard library makes some assumptions about this. In JavaScript, an empty string parses to zero.
The standard C library, which dates back to the stone age, does the simplest thing you can do without range checking, because, well, that's kinda the C paradigm. If you want parsing that handles edge cases in a specific way, you do it yourself. It's just digits.
No, but there are a myriad of incorrect ways and the C library's way is one of them.
It's perfectly fine to make reasonable choices for all those options and then implement them correctly.
Perhaps the right title should be "No way to parse pathological edge cases in 'C'"
And then see how other languages do.
None of the C functions referenced (atol, strtol, sscanf) are number-parsing functions per se. Rather, they're numeric-lexeme scanning+extraction functions.
These functions are all designed to avoid making any assumptions about the syntax of the larger document the numeric lexeme might be embedded in. You might, after all, be using a syntax where numbers can come with units on the end. Or you might be reading numbers as comma-separated values.
And, as a key point the author might be missing: C, in being co-designed with UNIX, offers primitives tuned for the context of:
- writing UNIX CLI tools that work with unbounded streams of input (i.e. piped output from other UNIX CLI tools),
- where, crucially, the stream is just text, and so carries no TLV-esque framing protocol to tell you the definitive length of a thing;
- and nor (especially in early memory-constrained systems) are you able to perform allocations of heap memory in order to employ an unbounded growable buffer for retaining the current lexeme until you do reach the end of it (which, if you could, would let you use a scanner state-machine that doubles as a parser/validator, returning either a parsed value or an error)
- but instead, to deal with the 1. unbounded input, 2. of textual encoding, 3. in constant memory, you must eagerly scan the input stream (i.e. synchronously reduce over each received byte, or at most each fixed-length N-byte chunk using a static or stack-allocated fixed-length buffer, discarding the original string bytes once reduced-over) to produce lexically-decoded (but not parsed/validated) lexemes; and then do this again, on a higher level, feeding your stream of lexemes into a fixed-sized sum-typed ring-buffer (i.e. an array-of-union-typed-lexeme-struct-type-entries), where you can then invoke a function that attempts to scan over + consume them (but unlike the original stream-parsing function, doesn't consume the buffer unless successful, and so isn't functioning as a scanner per se, but rather as an LR parser.)
If you're not writing UNIX CLI tools, direct use of the C-stdlib numeric-lexeme scan functions is operating on the wrong abstraction layer. What you want, if you have pre-framed strings that are "either valid numbers or parse errors", is to implement an actual parsing function... that can then invoke these numeric-lexer functions to do the majority of its work.
And if you're writing C, and yet you're not in UNIX-pipeline unbounded-text-stream land, but rather are parsing well-defined bounded-length "documents" (like, say, C source files)... then you probably want to use a real lexer-generator (like flex) to feed a parser-generator (like yacc/bison). Where:
- you'd validate the token in context, in the parsing phaase;
- and your lexing rules would make certain classes of input invalid at lexing time. (E.g. you can write your lexeme matching rules such that multi-digit numbers with leading zeroes, or floating-point values with no digits before/after the decimal place, simply aren't "numbers" from your lexer's perspective.)
...which means that, once again, you can "get away with" invokeing the regular C numeric-lexeme scanner functions; i.e. `yylval = atoi(yytext);` in bison terms. (And you'd want to, since doing so saves memory vs. keeping the numbers around as strings.)
For integers, you're faster (in both development time and runtime) to write your own parser than to try and assemble the pieces in this pile of shit into a half-working one.
C++17 from_chars excluded. Incidentally, 2022 seems about right for the year that ONE open source implementation finally actually implemented the float part of that. Or was it more like 2024?
This wouldn't even pass a cursory sanity check of the api from a beginner developer, how did it end up in a standard library at all? Was it a mistake and then it was just too late to remove it?
Any function that can either succeed or fail, which is basically every parsing function, must typically indicate success or failure. You can terminate the program or you can return an object that itself indicates failure (such as -1 when finding a positive index) but if ALL values of the return type CAN be valid then the success state must be a separate return value.
What's the purpose of the function atol() if it doesn't have that? Is it "It's still useful for trusted input we know is a string representation of a long" (E.g. for bounded number roundtrip)? That seems awfully limited. But perhaps such a scenario was perhaps more common in 1960?