3.2. Character set

Viper uses the full ISO-10646 character set. By default, the characters are assumed to be encoded using UTF-8, a US-ASCII compatible encoding, and the standard representation of ISO-10646 for Linux.

Viper accepts the full range of code points recommended by ISO-10646 in identifiers. The specification is taken from the ISO C++ Standard, Appendix C.

Support for ISO-10646 and UTF-8 is incomplete. A table of characters supported in identifiers should be provided. Details of UCS-16 support should be provided. Details of encoding autodetection should be provided when it is implemented. Details of level 2 and 3 support should be provided (combining characters).

Generally, Viper simply processes strings 8 bit clean, which is enough for handling correct UTF-8 in most cases. There are several exceptions.

Case mapping requires extra support with ISO-10646 and UTF-8. This is not yet implemented. It impacts the string and re modules in particular.

Lexing of identifiers requires additional regular expressions in the native lexer which have not yet been provided. The lexer is 8 bit clean, and can easily recognize UTF-8 encoded ISO-10646. The lexer does not user any character set specific escapes in it's regular expressions.

Similarly, recognition of the full set of spacing characters, and processing of leading whitespace, are not yet extended to ISO-10646/UTF-8.

Finally, support for \uXXXX and \UXXXXXXXX escapes in strings is implemented, however it is active in both viper and python modes.