Character sets and encodings
Basic character set
The basic character set consists of the following 95 characters:
Code unit | Character | Glyph |
---|---|---|
U+0009 | Character tabulation | |
U+000B | Line tabulation | |
U+000C | Form feed (FF) | |
U+0020 | Space | |
U+0021 | Exclamation mark | !
|
U+0022 | Quotation mark | "
|
U+0023 | Number sign | #
|
U+0025 | Percent sign | %
|
U+0026 | Ampersand | &
|
U+0027 | Apostrophe | '
|
U+0028 | Left parenthesis | (
|
U+0029 | Right parenthesis | )
|
U+002A | Asterisk | *
|
U+002B | Plus sign | +
|
U+002C | Comma | ,
|
U+002D | Hyphen-minus | -
|
U+002E | Full stop | .
|
U+002F | Solidus | /
|
U+0030 .. U+0039 | Digit zero .. nine | 0 1 2 3 4 5 6 7 8 9
|
U+003A | Colon | :
|
U+003B | Semicolon | ;
|
U+003C | Less-than sign | <
|
U+003D | Equals sign | =
|
U+003E | Greater-than sign | >
|
U+003F | Question mark | ?
|
U+0041 .. U+005A | Latin capital letter A .. Z | A B C D E F G H I J K L M
|
U+005B | Left square bracket | [
|
U+005C | Reverse solidus | \
|
U+005D | Right square bracket | ]
|
U+005E | Circumflex accent | ^
|
U+005F | Low line | _
|
U+0061 .. U+007A | Latin small letter a .. z | a b c d e f g h i j k l m
|
U+007B | Left curly bracket | {
|
U+007C | Vertical line | |
|
U+007D | Right curly bracket | }
|
U+007E | Tilde | ~
|
Unlike C++, the U+000A LINE FEED (LF) character is not included in basic character set. Instead, there shall be some way of indicating the end of each line of text in the source file and the document treats such an end-of-line indicator as if it were a single new-line character.
Basic character set is also known as basic source character set.
Basic execution character set
The basic execution character set contains all the members of the basic character set, plus the following characters:
Code unit | Character |
---|---|
U+0000 | Null |
U+0007 | Bell |
U+0008 | Backspace |
U+000A | Line feed (LF) |
U+000D | Carriage return (CR) |
For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. The U+0000 NULL character has the value 0.
The representation of each member of the basic execution character sets fit in a byte.
In C++, basic execution character set is also known as basic literal character set and basic execution wide-character set.
Literal encodings
The literal encoding is an implementation-defined mapping of the characters of the execution character set to the values in a character constant or string literal without encoding prefix. It supports a mapping from all the basic execution character set values into the implementation-defined encoding. It may contain multibyte character sequences.
The following characters are not in basic execution character set, but they are required to be encoded as a single byte in an ordinary character constant or ordinary string literal.
|
(since C23) |
The wide literal encoding is an implementation-defined mapping of the characters of the execution character set to the values in an L
-prefixed character constant or string literal. It supports a mapping from all the basic execution character set values into the implementation-defined encoding. If an implementation does not define __STDC_MB_MIGHT_NEQ_WC__
, the mapping produces values identical to the literal encoding for all the basic execution character set values. One or more values may map to one or more values of the extended execution character set.
The UTF-8 encoding is used for mapping characters of the execution character set to a An implementation-defined encoding(until C23)The UTF-16 encoding(since C23) is used for mapping characters of the execution character set to a An implementation-defined encoding(until C23)The UTF-32 encoding(since C23) is used for mapping characters of the execution character set to a |
(since C11) |
See also
ASCII chart | |
C++ documentation for Character sets and encodings
|