Scatter/Gather thoughts

by Johan Petersson

Character types in C and C++

How many built-in character types are there in C++? The answer may surprise you.

The language described in the original 1978 C Programming Language (aka "the White Book") by Kernighan and Ritchie didn't have the keyword signed, meaning that there were only two character types: char and unsigned char.

This was analogous to the situation for int (which is always signed) and unsigned int, except that C compilers were allowed to make char an unsigned type, which many did – typically due to platform conventions or better optimization opportunities for unsigned integer arithmetic. Granting compilers such latitude has been an important factor in making C a highly portable, yet efficient, language.

Leaving the signedness of char up to the implementation had its drawbacks, though. On a platform where the plain character type was unsigned, you'd have one less built-in type for small integers; the smallest signed type was short, which could very well be larger than a char. As can be expected, lots of programs were written that relied on char being signed or unsigned.

In the ANSI C Draft Standard, the keyword signed was added, introducing a signed char type for all platforms. The new keyword solved the problem of not being able to use signed char portably, but at this point the standard committee could not mandate plain char to be signed. It would break a lot of code and upset vendors as well as users.

The compromise was to make signed char a type distinct from the two existing character types, while requiring char to have the same representation and values as either signed char or unsigned char. In other words, a char must look exactly like a signed char or unsigned char to the hardware; which one is implementation-defined. C++ later adopted this compromise for compatibility with C, so both languages now have three distinct char types.

You have probably seen the wide character type wchar_t even if you haven't used it (there are certain caveats to wchar_t, but that's a topic for another time). The _t suffix is a common convention indicating a typedef name, and that's the way the C standard defines wchar_t (in the header stddef.h). Since typedef doesn't create types, only new names for other types, wchar_t is not a distinct type in C.

In contrast, the C++ standard defines wchar_t as a built-in type with "the same size, signedness, and alignment requirements as one of the other integral types, called its underlying type" (C++98 §3.9.1). This makes wchar_t a distinct type with the same representation as another type, in a way quite similar to char and subtly different from the wide character type in C.

Like the char type in C and C++, it is implementation-defined whether wchar_t is a signed or unsigned type. Does this mean there are three distinct types of wide characters as well? No, signedness can't be forced by using unsigned wchar_t or signed wchar_t; there are no such types and compilers should flag the code as erroneous.

There were no legacy reasons for introducing signed and unsigned variants of the wide character type, and it doesn't make sense to use wchar_t for storing integers anyway; it has the same representation as a built-in integral type, after all. There are thus four distinct character types in Standard C++:

Is this useful information or merely pedantic trivia? Knowing the distinct character types is important when you overload functions and specialize templates in C++, but even in C it can be relevant due to the way conversions work:

int main(void)
{
             char *a = "Hello, World!";
    unsigned char *b = a; /* distinct types! */
    signed   char *c = a; /* distinct types! */

    return 0;
}

C compilers are supposed to warn about the above code, but in practice many do not. gcc will inform you that pointer targets in assignment differ in signedness if you use the -pedantic flag, but the default is to silently accept such conversions. g++ correctly rejects the same program:

error: invalid conversion from `char*' to `unsigned char*'
error: invalid conversion from `char*' to `signed char*'

Casts should be used in both languages when converting between pointers to the different char types. In C++ you can't get away with being sloppy; omitting the cast is illegal.

26 January, 2005