Utilities to convert between std::string and std::wstring. More...
Enumerations | |
enum | TextEncoding { encUNSPECIFIED, encUTF8, encUTF16BE, encUTF16LE, encUTF32BE, encUTF32LE, encSCSU, encUTF7, encUTFEBCDIC, encBOCU1 } |
Functions | |
DSOEXPORT std::wstring | decodeCanonicalString (const std::string &str, int version) |
Converts a std::string with multibyte characters into a std::wstring. | |
DSOEXPORT std::string | encodeCanonicalString (const std::wstring &wstr, int version) |
Converts a std::wstring into canonical std::string. | |
DSOEXPORT boost::uint32_t | decodeNextUnicodeCharacter (std::string::const_iterator &it, const std::string::const_iterator &e) |
Return the next Unicode character in the UTF-8 encoded string. | |
DSOEXPORT std::string | encodeUnicodeCharacter (boost::uint32_t ucs_character) |
Encodes the given wide character into a canonical string, theoretically up to 6 chars in length. | |
DSOEXPORT std::string | encodeLatin1Character (boost::uint32_t ucsCharacter) |
Encodes the given wide character into an at least 8-bit character. | |
DSOEXPORT char * | stripBOM (char *in, size_t &size, TextEncoding &encoding) |
Interpret (and skip) Byte Order Mark in input stream. | |
DSOEXPORT const char * | textEncodingName (TextEncoding enc) |
Return name of a text encoding. |
Utilities to convert between std::string and std::wstring.
Strings in Gnash are generally stored as std::strings. We have to deal, however, with characters larger than standard ASCII (128), which can be encoded in two different ways.
SWF6 and later use UTF-8, encoded as multibyte characters and allowing many thousands of unique codes. Multibyte characters are difficult to handle, as their length - used for many string operations - is not certain without parsing the string. Converting the string to a wstring (generally a uint32_t - the pp seems only to handle characters up to 65535 - two bytes is the minimum size of a wchar) facilitates string operations, as the length of the string is equal to the number of valid characters.
SWF5 and earlier, however, used the ISO-8859 specification, allowing the standard 128 ASCII characters plus 128 extra characters that depend on the particular subset of ISO-8859. Characters are 8 bits, not the ASCII standard 7. SWF5 cannot handle multi-byte characters without special functions.
It is important that SWF5 can distinguish between the two encodings, so we cannot convert all strings to UTF-8.
Presently, this code is used for the AS String object, gnash::edit_text_character, ord() and chr().
enum utf8::TextEncoding |
std::wstring utf8::decodeCanonicalString | ( | const std::string & | str, | |
int | version | |||
) |
Converts a std::string with multibyte characters into a std::wstring.
str | the canonical string to convert. | |
version | the SWF version, used to decide how to decode the string. For SWF5, UTF-8 (or any other) multibyte encoded characters are converted char by char, mangling the string. |
References decodeNextUnicodeCharacter(), gnash::key::e, and INVALID_CHAR.
Referenced by gnash::TextField::replaceSelection(), gnash::TextField::TextField(), gnash::TextField::updateHtmlText(), and gnash::TextField::updateText().
boost::uint32_t utf8::decodeNextUnicodeCharacter | ( | std::string::const_iterator & | it, | |
const std::string::const_iterator & | e | |||
) |
Return the next Unicode character in the UTF-8 encoded string.
Invalid UTF-8 sequences produce a U+FFFD character as output. Advances string iterator past the character returned, unless the returned character is '', in which case the iterator does not advance.
References FIRST_BYTE, and NEXT_BYTE.
Referenced by decodeCanonicalString().
std::string utf8::encodeCanonicalString | ( | const std::wstring & | wstr, | |
int | version | |||
) |
Converts a std::wstring into canonical std::string.
wstr | the wide string to convert. | |
version | the SWF version, used to decide how to encode the string. |
For SWF 5, each character is stored as an 8-bit (at least) char, rather than converting it to a canonical UTF-8 byte sequence. Gnash can then distinguish between 8-bit characters, which it handles correctly, and multi-byte characters, which are regarded as multiple characters for string methods.
References encodeLatin1Character(), and encodeUnicodeCharacter().
Referenced by gnash::TextField::get_htmltext_value(), gnash::TextField::get_text_value(), gnash::TextField::setHtmlTextValue(), and gnash::TextField::setTextValue().
std::string utf8::encodeLatin1Character | ( | boost::uint32_t | ucsCharacter | ) |
Encodes the given wide character into an at least 8-bit character.
Allows storage of Latin1 (ISO-8859-1) characters. This is the format of SWF5 and below.
Referenced by encodeCanonicalString().
std::string utf8::encodeUnicodeCharacter | ( | boost::uint32_t | ucs_character | ) |
Encodes the given wide character into a canonical string, theoretically up to 6 chars in length.
Referenced by encodeCanonicalString().
char * utf8::stripBOM | ( | char * | in, | |
size_t & | size, | |||
TextEncoding & | encoding | |||
) |
Interpret (and skip) Byte Order Mark in input stream.
This function takes a pointer to a buffer and returns the start of actual data after an eventual BOM. No conversion is performed, no bytes copy, just skipping of the BOM snippet and interpretation of it returned to the encoding input parameter.
See http://en.wikipedia.org/wiki/Byte-order_mark
in | The input buffer. | |
size | Size of the input buffer, will be decremented by the size of the BOM, if any. | |
encoding | Output parameter, will always be set. encUNSPECIFIED if no BOM is found. |
const char * utf8::textEncodingName | ( | TextEncoding | enc | ) |
Return name of a text encoding.
References encBOCU1, encSCSU, encUNSPECIFIED, encUTF16BE, encUTF16LE, encUTF32BE, encUTF32LE, encUTF7, encUTF8, and encUTFEBCDIC.