In my opinion, unwritten rules are for gatekeeping. And if a new person follows all the unwritten rules, magically there's no one willing to review.
I think this is how large BFDL-style open source projects slowly become less and less relevant over the next few decades.
Fir these projects everything "tribal" has to be explicitly codified.
On a more general note: this is likely going to have a rather big impact on software in general - the "engineer to company can not afford to loose" is likely loosing their moat entirely.
For large works, the burden shifts, since you are increasing the maintenance load. Now we have the question of who will do the future work, and that requires judgement of the importance of the work and/or the author, and hence is a fundamentally political question.
I don't believe there's anybody who can reason about them at code skimming speeds. It's probably the best place to hide underhanded code.
-Wextra catches stuff like this, alas I know of a few people that think "-Wextra is evil" (even though annoying warnings can be selectively disabled)
I still remember being expected to pass -Wpedantic (and probably also -Wextra) in university.
One thing that I am glad to have been taught early on in my career when it comes to debugging, especially anything involving HW, is to `make no assumptions'. Bugs can be anywhere and everywhere.
One thing I noticed: The last footnote is missing.
There is absolutely no "sign extension" in the C standard (go ahead, search it). "Sign extension" is a feature of some assembly instructions on some architectures, but C has nothing to do with it.
Citing integer promotion from the standard is justified, but it's just one part (perhaps even the smaller part) of the picture. The crucial bit is not quoted in the article: the specification of "Bitwise shift operators". Namely
> The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand. [...]
> The result of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are filled with zeros. If E1 has an unsigned type, the value of the result is E1×2^E2, reduced modulo one more than the maximum value representable in the result type. If E1 has a signed type and nonnegative value, and E1×2^E2 is representable in the result type, then that is the resulting value; otherwise, the behavior is undefined.
What happens here is that "base2" (of type uint8_t, which is "unsigned char" in this environment) gets promoted to "int", and then left-shifted by 24 bits. You get undefined behavior because, while "base2" (after promotion) has a signed type ("int") and nonnegative value, E1×2^E2 (i.e., base2 × 2^24) is NOT representable in the result type ("int").
What happens during the conversion to "uint64_t" afterwards is irrelevant; even the particulars of the sign bit of "int", and how you end up with a negative "int" from the shift, are irrelevant; you got your UB right inside the invalid left-shift. How said UB happens to materialize on this particular C implementation may perhaps be explained in terms of sign extension of the underlying ISA -- but do that separately; be absolutely clear about what is what.
The article fails to mention the root cause (violating the rules for the bitwise left-shift operator) and fails to name the key consequence (undefined behavior); instead, it leads with not-a-thing ("sign-extension bug in C"). I'm displeased.
BTW this bug (invalid left shift of a signed integer) is common, sadly.
It does not matter which is the relationship between the sizes of such types, there will always be values of the operand that cannot be represented in the result.
Saying that the behavior is sometimes undefined is not acceptable. Any implicit conversion of this kind must be an error. Whenever a conversion between signed and unsigned or unsigned and signed is desired, it must be explicit.
This may be the worst mistake that has ever been made in the design of the C language and it has not been corrected even after 50 years.
Making this an error would indeed produce a deluge of error messages in many carelessly written legacy programs, but the program conversion is trivial and it is extremely likely that many of these cases where the compilers do not signal errors can cause bugs in certain corner cases, like in the parent article.
However, this kind of implicit conversions must really be forbidden in the standard, because the correct program source is different from the one permitted by the standard.
When you activate most compiler options that detect undefined behaviors, the correct program source remains the same, even if the compiler now implements a better behavior for the translated program than the minimal behavior specified by the standard.
That happens because most undefined behaviors are detected at run time. On the other hand, incorrect implicit conversions are a property of the source code, which is always detected during compilation, so such programs must be rejected.
But it is easy enough to use modern tooling and coding styles to deal with signed overflow. Nowadays, silent unsigned wrap around causing logic errors is the more vexing issue, which indicates the undefined behavior actually helps rather than hurts when used with good tooling.
The hardware of modern CPUs actually implements 5 distinct data types that must be declared as "unsigned" in C: non-negative integers, integer residues a.k.a. modular integers, bit strings, binary polynomials and binary polynomial residues.
A modern programming language should better have these 5 distinct types, but it must have at least distinct types for non-negative integers and for integer residues. There are several programming languages that provide at least this distinction. The other data types would be more difficult to support in a high-level language, as they use certain machine instructions that compilers typically do not know how to use.
The change in the C standard that was made so that now "unsigned" means integer residue, has left the language without any means to specify a data type for non-negative integers, which is extremely wrong, because there are more programs that use "unsigned" for non-negative integers than programs that use "unsigned" for integer residues.
The hardware of most CPUs implements very well non-negative integers so non-negative integer overflow is easily detected, but the current standard makes impossible to use the hardware.
I agree though that using "unsigned" for non-negative integers is problematic and that there should be a way to specify non-negative integers. I would be fine with an attribute.
The problem is also that the standard committee is not the ruling body of the C language. It is the place where people come together to negotiate some minimal requirements. If you want something, you need to first convince the compilers vendors to implement it as an extension.
Yes, that's true, but the registers themselves are untyped, what modern CPUs really implement is multiple instruction semantics over the same bit-patterns. In short: same bits, five algebras! The algebras are given by different instructions (on the same bit patterns).
Here is an example, the bit pattern 1011:
• as a non-negative integer: 11. ISA operations: Arm UDIV, RISC-V DIVU, x86 DIV
• as an integer residue mod 16: the class [11] in Z/16Z. ISA operations: Arm ADD, RISC-V ADD/ADDI, x86 ADD
• as a bit string: bits 3, 1, and 0 are set. ISA operations: Arm EOR, RISC-V ANDI/ORI/XORI, x86 AND.
• as a binary polynomial: x^3 + x + 1. ISA operations: Arm PMULL, RISC-V clmul/clmulh/clmulr, x86 PCLMULQDQ
• as a binary polynomial residue modulo, say, x^4 + x + 1: the residue class of x^3 + x + 1 in GF(2)[x] / (x^4 + x + 1). ISA operations: Arm CRC32* / CRC32C*, x86 CRC32, RISC-V clmulr
And actually ... the floating point numbers also have the same bit patters, and could, in principle reside in the same registers. On modern ISAs, floats are usually implemented in a distinct register file.
You can use different functions in C on the bit patterns we call unsigned.
If you had a data type with type tags, that still would not mean that the storage location for it is typed, it would only mean that you have implemented a union type.
Typed memory would mean to partition the memory into separate areas for integers, floating-point numbers, strings, etc., which makes no sense because you cannot predict the size of the storage area required for each data type.
In modern CPUs, the registers are typically partitioned by data type into only 3 or 4 sets: first the so-called general purpose registers, which are used for any kind of scalar data types except floating-point numbers, second a set of scalar floating-point registers, third a set of vector registers used for any kind of vector data types and in very recent CPUs there may be a fourth set of matrix registers, also used for many data types.
In most current CPUs, e.g. Intel/AMD x86-64 and ARM Aarch64, the scalar floating-point registers are aliased over the vector registers, so these 2 do not form separate register sets.
A finer form of typing for CPU registers is not useful, because it cannot be predicted how many registers of each type will be needed.
Therefore, as you say, the data type of an operation is encoded in the instruction and it is independent of the registers used for operands or results.
Moreover, there are several cases when the same instruction code can be used for multiple data types and the context determines which was the intended data type.
For instance, the same instruction for register addition can be used to add signed integers, non-negative integers and integer residues. The intended data types are distinguished by the following instructions. If the overflow flag is tested, it was an addition of signed integers. If the carry flag is tested, it was an addition of non-negative integers. If the flags are ignored, it was an addition of integer residues.
Another example is the bitwise addition modulo 2 (a.k.a. XOR), which, depending on the context, can be interpreted as addition of bit strings or as addition of binary polynomials.
Yet another example is a left rotation instruction, which can be interpreted as either a rotation of a bit string or as a multiplication by a power of 2 of an integer residue modulo 2^N-1 (this is less known than the fact that shift left is equivalent with a multiplication modulo 2^N).
While registers and even instruction encodings can be reused for multiple data types, which leads to significant hardware savings, any program, including the programs written in assembly language, should better define clearly and accurately the exact types of any variables, both to ensure that the program will be easily understood by maintainers and to enable the detection of bugs by program analysis.
The most frequent use of "unsigned" in C programs is for non-negative integers, despite the fact that the current standard specifies that the operations with "unsigned" must be implemented as operations with integer residues. This obviously bad feature of the standard has the purpose of allowing lazy programmers to avoid the handling of exceptions, because operations with integer residues cannot generate exceptions. This laziness can frequently lead to bugs that are not detected or they are detected only after they had serious consequences.
I believe that if one reserves "unsigned" to mean "non-negative integer", then one should use typedefs for different data types whenever "unsigned" is used for another data type, and that includes bit strings, which is probably the next most frequently used data type for which "unsigned" is used.
IBM PL/I, from which the C language has taken many keywords and symbols, including "&" and "|", had distinct types for integers and for bit strings, but C did not also take this feature.
One interesting programming language construct that might be useful in this context are Opaque Type Synonyms, a refined form of C's typedef, which modern languages like Rust, Haskell, Go or Scala offer. This allows the programmer to use the same underlying types (e.g. int), give it different names, and define different algebras with the alias. The typing system prevents the different aliases accidentally to flow into each other. Of course that alone does not help to manage the profusion of algebras over the same bits. I think a better approach for a high-level programming language is to follow assembly and really use different names for different operations, e.g. not have + build in. Instead use explicit names like add_uint32, add_polynomials_gf_2, add_satur_arith, etc etc. The user can then explicitly define (scoped) aliases for them, including +, as long as the typing system can disambiguate the uses. The Sail DSL for ISA specification (https://github.com/rems-project/sail) does this, and it is nice.
x
in some situations, but need to use x._polynomials_gf_2
or whatever is the structure's field name. It is nice to avoid this boilerplate, which can become annoying quickly. Let the type-checker not the human do this work ...> You do not need another language for this.
By the Church-Turing thesis you never need another language, but empirical practise has shown that the software engineering properties we see with real-world code and real-world programmers differ significantly between languages.
No, one doesn't need undefined behavior for that at all (which does hurt).
What actually helps is diagnosing the issue, just like one can diagnose the unsigned case just fine (which is not UB).
Instead, for this sort of thing, C could have "Erroneous Behavior", like Rust has (C++ also added it, recently).
Of course, existing ambiguous C code will remain to be tricky. What matters, after all, is having ways to express what we are expecting in the source code, so that a reader (whether tooling, humans or LLMs) can rely on that.
Does it also complain when the assigned variable is big enough to avoid the problem? Does the compiler generate slower code with the explicit conversions?
It looks like an nice task to compile major projects with -Wsign-conversion and send PR fixing the warnings. (Assuming they are only a few, let's say 5. Sending an uninvited PR with a thousand changes will make the maintainers unhappy.)
It's not that bad actually; not "always". The only nontrivial case is when, as a part of the usual arithmetic conversions, you (perhaps unwittingly) convert a signed integer type to an unsigned integer type [*], and the original value was negative.
[*] This can happen in two cases (paraphrasing the standard):
- if the operand that has unsigned integer type has rank greater than or equal to the rank of the signed integer type of the other operand,
- if the operand that has signed integer type has rank greater than or equal to the rank of the unsigned integer type of the other operand, but the signed integer type cannot represent all values of the unsigned integer type.
Examples: (a) "unsigned int" vs. "signed int"; (b) "long signed int" vs. "unsigned int" in a POSIX ILP32 programming environment. Under (a), you get conversion to "unsigned int"; under (b), you get conversion (for both operands) to "long unsigned int".
Section "3.2 Conversions | 3.2.1 Arithmetic operands | 3.2.1.1 Characters, and integers" in the C89 Rationale <https://www.open-std.org/Jtc1/sc22/WG14/www/C89Rationale.pdf> is worth reading. (An updated version of the same section is included in the C99 Rationale <https://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.1...> under 6.3.1.1.)
It deals precisely with the problem highlighted in the blog post. I'll quote just the beginning and the end:
> Since the publication of K&R, a serious divergence has occurred among implementations of C in the evolution of integral promotion rules. Implementations fall into two major camps, which may be characterized as unsigned preserving and value preserving. [...]
> The unsigned preserving rules greatly increase the number of situations where unsigned int confronts signed int to yield a questionably signed result, whereas the value preserving rules minimize such confrontations. Thus, the value preserving rules were considered to be safer for the novice, or unwary, programmer. After much discussion, the Committee decided in favor of value preserving rules, despite the fact that the UNIX C compilers had evolved in the direction of unsigned preserving.
> QUIET CHANGE -- A program that depends upon unsigned preserving arithmetic conversions will behave differently, probably without complaint. This is considered the most serious semantic change made by the Committee to a widespread current practice.
Hmm? Seems to me that unsigned -> larger signed works, although other conversions may not.
But yes, I generally agree that these are terrible conversions to do implicitly, given that the entire point of those types is to control the interpretation of memory at a bits-and-bytes level. Languages where implicit numeric conversions make sense are generally not languages that care so much about integer size, and the entire point of having unsigned types is to bake that range constraint in.
C seems to be one of those languages where people think they know it based on prior and adjacent experience. But it is not a language which can be learned based on experience alone. The language is full of cases where things will go badly wrong in a way which is neither obvious nor immediately evident. The negative side effects of what you did often only become evident long after you "learn" it as something you "can" do.
If you want to write C for anything where any security, safety, or reliability requirement needs to be met, you should commit to this strategy: Do not write any code which you are not absolutely certain you could justify the behaviour of by referencing the standard or (in the case of reliance on a specific definition of implementation defined, unspecified, or even (e.g. -ftrapv) undefined behaviour) the implementation documentation.
If you cannot commit to such a (rightfully mentally arduous) policy, you have no business writing C.
The same can actually be applied to C++ and Bash.
But the advice really applies to almost everything you do related to security, safety and reliability. In other languages you may have a panic in production or a supply chain issue.
Doing this for every line is impossibly tedious (people will quickly tire of it), and detecting where the code is actually non-trivial requires a kind of epistemic humility that doesn't come naturally to most.
Better if we can use languages that don't assume such demands are necessary for the compiler to be able to generate performant code.
So it seems in regard to bit shifts, C++ behaves slightly differently (it seems to have less UB) than C.
This matches my experience whenever I do an unconventional or deep work like the article mentions. The engineers comfortable with this type of work will multiply their worth.
> Since virtualization is hardware assisted these days
I was running Xen with full-hardware virtualization on consumer hardware in... 2006. I mean: some of us here were running hardware virt before some of the commenters were born. Just to put the "these days" into perspective in case some would be thinking it's a new thing.