The ISO C standard defines functions to convert strings from a multibyte representation to wide character strings. There are a number of peculiarities:
The character set assumed for the multibyte encoding is not specified as an argument to the functions. Instead the character set specified by the LC_CTYPE category of the current locale is used; see the section called “Categories of Activities that Locales Affect”.
The functions handling more than one character at a time require NUL terminated strings as the argument. I.e., converting blocks of text does not work unless one can add a NUL byte at an appropriate place. The GNU C library contains some extensions the standard which allow specifying a size but basically they also expect terminated strings.
Despite these limitations the ISO C functions can very well be used in many contexts. In graphical user interfaces, for instance, it is not uncommon to have functions which require text to be displayed in a wide character string if it is not simple ASCII. The text itself might come from a file with translations and the user should decide about the current locale which determines the translation and therefore also the external encoding used. In such a situation (and many others) the functions described here are perfect. If more freedom while performing the conversion is necessary take a look at the iconv functions (the section called “Generic Charset Conversion”).
We already said above that the currently selected locale for the LC_CTYPE category decides about the conversion which is performed by the functions we are about to describe. Each locale uses its own character set (given as an argument to localedef) and this is the one assumed as the external multibyte encoding. The wide character character set always is UCS-4, at least on GNU systems.
A characteristic of each multibyte character set is the maximum number of bytes which can be necessary to represent one character. This information is quite important when writing code which uses the conversion functions. In the examples below we will see some examples. The ISO C standard defines two macros which provide this information.
int function>MB_LEN_MAX/function> This macro specifies the maximum number of bytes in the multibyte sequence for a single character in any of the supported locales. It is a compile-time constant and it is defined in limits.h. int function>MB_CUR_MAX/function> MB_CUR_MAX expands into a positive integer expression that is the maximum number of bytes in a multibyte character in the current locale. The value is never greater than MB_LEN_MAX. Unlike MB_LEN_MAX this macro need not be a compile-time constant and in fact, in the GNU C library it is not.
MB_CUR_MAX is defined in stdlib.h.
Two different macros are necessary since strictly ISO C90 compilers do not allow variable length array definitions but still it is desirable to avoid dynamic allocation. This incomplete piece of code shows the problem:
{ char buf[MB_LEN_MAX]; ssize_t len = 0; while (! feof (fp)) { fread (buf[len], 1, MB_CUR_MAX - len, fp); /* ... process buf */ len -= used; } }
The code in the inner loop is expected to have always enough bytes in the array buf to convert one multibyte character. The array buf has to be sized statically since many compilers do not allow a variable size. The fread call makes sure that always MB_CUR_MAX bytes are available in buf. Note that it isn't a problem if MB_CUR_MAX is not a compile-time constant.
In the introduction of this chapter it was said that certain character sets use a stateful encoding. I.e., the encoded values depend in some way on the previous bytes in the text.
Since the conversion functions allow converting a text in more than one step we must have a way to pass this information from one call of the functions to another.
function>mbstate_t/function> A variable of type mbstate_t can contain all the information about the shift state needed from one call to a conversion function to another.
This type is defined in wchar.h. It got introduced in Amendment 1 to ISO C90.
To use objects of this type the programmer has to define such objects (normally as local variables on the stack) and pass a pointer to the object to the conversion functions. This way the conversion function can update the object if the current multibyte character set is stateful.
There is no specific function or initializer to put the state object in any specific state. The rules are that the object should always represent the initial state before the first use and this is achieved by clearing the whole variable with code such as follows:
{ mbstate_t state; memset (state, '\0', sizeof (state)); /* from now on state can be used. */ ... }
When using the conversion functions to generate output it is often necessary to test whether the current state corresponds to the initial state. This is necessary, for example, to decide whether or not to emit escape sequences to set the state to the initial state at certain sequence points. Communication protocols often require this.
int function>mbsinit/function> (const mbstate_t *ps) This function determines whether the state object pointed to by ps is in the initial state or not. If ps is a null pointer or the object is in the initial state the return value is nonzero. Otherwise it is zero.
This function was introduced in Amendment 1 to ISO C90 and is declared in wchar.h.
Code using this function often looks similar to this:
{ mbstate_t state; memset (state, '\0', sizeof (state)); /* Use state. */ ... if (! mbsinit (state)) { /* Emit code to return to initial state. */ const wchar_t empty[] = L""; const wchar_t *srcp = empty; wcsrtombs (outbuf, srcp, outbuflen, state); } ... }
The code to emit the escape sequence to get back to the initial state is interesting. The wcsrtombs function can be used to determine the necessary output code (the section called “Converting Multibyte and Wide Character Strings”). Please note that on GNU systems it is not necessary to perform this extra action for the conversion from multibyte text to wide character text since the wide character encoding is not stateful. But there is nothing mentioned in any standard which prohibits making wchar_t using a stateful encoding.
The most fundamental of the conversion functions are those dealing with single characters. Please note that this does not always mean single bytes. But since there is very often a subset of the multibyte character set which consists of single byte sequences there are functions to help with converting bytes. One very important and often applicable scenario is where ASCII is a subpart of the multibyte character set. I.e., all ASCII characters stand for itself and all other characters have at least a first byte which is beyond the range 0 to 127.
wint_t function>btowc/function> (int c) The btowc function ("byte to wide character") converts a valid single byte character c in the initial shift state into the wide character equivalent using the conversion rules from the currently selected locale of the LC_CTYPE category.
If (unsigned char) c is no valid single byte multibyte character or if c is EOF the function returns WEOF.
Please note the restriction of c being tested for validity only in the initial shift state. There is no mbstate_t object used from which the state information is taken and the function also does not use any static state.
This function was introduced in Amendment 1 to ISO C90 and is declared in wchar.h.
Despite the limitation that the single byte value always is interpreted in the initial state this function is actually useful most of the time. Most characters are either entirely single-byte character sets or they are extension to ASCII. But then it is possible to write code like this (not that this specific example is very useful):
wchar_t * itow (unsigned long int val) { static wchar_t buf[30]; wchar_t *wcp = buf[29]; *wcp = L'\0'; while (val != 0) { *--wcp = btowc ('0' + val % 10); val /= 10; } if (wcp == buf[29]) *--wcp = L'0'; return wcp; }
Why is it necessary to use such a complicated implementation and not simply cast '0' + val % 10 to a wide character? The answer is that there is no guarantee that one can perform this kind of arithmetic on the character of the character set used for wchar_t representation. In other situations the bytes are not constant at compile time and so the compiler cannot do the work. In situations like this it is necessary btowc.
There also is a function for the conversion in the other direction.
int function>wctob/function> (wint_t c) The wctob function ("wide character to byte") takes as the parameter a valid wide character. If the multibyte representation for this character in the initial state is exactly one byte long the return value of this function is this character. Otherwise the return value is EOF.
This function was introduced in Amendment 1 to ISO C90 and is declared in wchar.h.
There are more general functions to convert single character from multibyte representation to wide characters and vice versa. These functions pose no limit on the length of the multibyte representation and they also do not require it to be in the initial state.
size_t function>mbrtowc/function> (wchar_t *restrict pwc, const char *restrict s, size_t n, mbstate_t *restrict ps) The mbrtowc function ("multibyte restartable to wide character") converts the next multibyte character in the string pointed to by s into a wide character and stores it in the wide character string pointed to by pwc. The conversion is performed according to the locale currently selected for the LC_CTYPE category. If the conversion for the character set used in the locale requires a state the multibyte string is interpreted in the state represented by the object pointed to by ps. If ps is a null pointer, a static, internal state variable used only by the mbrtowc function is used.
If the next multibyte character corresponds to the NUL wide character the return value of the function is 0 and the state object is afterwards in the initial state. If the next n or fewer bytes form a correct multibyte character the return value is the number of bytes starting from s which form the multibyte character. The conversion state is updated according to the bytes consumed in the conversion. In both cases the wide character (either the L'\0' or the one found in the conversion) is stored in the string pointer to by pwc iff pwc is not null.
If the first n bytes of the multibyte string possibly form a valid multibyte character but there are more than n bytes needed to complete it the return value of the function is (size_t) -2 and no value is stored. Please note that this can happen even if n has a value greater or equal to MB_CUR_MAX since the input might contain redundant shift sequences.
If the first n bytes of the multibyte string cannot possibly form a valid multibyte character also no value is stored, the global variable errno is set to the value EILSEQ and the function returns (size_t) -1. The conversion state is afterwards undefined.
This function was introduced in Amendment 1 to ISO C90 and is declared in wchar.h.
Using this function is straight forward. A function which copies a multibyte string into a wide character string while at the same time converting all lowercase character into uppercase could look like this (this is not the final version, just an example; it has no error checking, and leaks sometimes memory):
wchar_t * mbstouwcs (const char *s) { size_t len = strlen (s); wchar_t *result = malloc ((len + 1) * sizeof (wchar_t)); wchar_t *wcp = result; wchar_t tmp[1]; mbstate_t state; size_t nbytes; memset (state, '\0', sizeof (state)); while ((nbytes = mbrtowc (tmp, s, len, state)) 0) { if (nbytes = (size_t) -2) /* Invalid input string. */ return NULL; *result++ = towupper (tmp[0]); len -= nbytes; s += nbytes; } return result; }
The use of mbrtowc should be clear. A single wide character is stored in tmp[0] and the number of consumed bytes is stored in the variable nbytes. In case the the conversion was successful the uppercase variant of the wide character is stored in the result array and the pointer to the input string and the number of available bytes is adjusted.
The only non-obvious thing about the function might be the way memory is allocated for the result. The above code uses the fact that there can never be more wide characters in the converted results than there are bytes in the multibyte input string. This method yields to a pessimistic guess about the size of the result and if many wide character strings have to be constructed this way or the strings are long, the extra memory required allocated because the input string contains multibyte characters might be significant. It would be possible to resize the allocated memory block to the correct size before returning it. A better solution might be to allocate just the right amount of space for the result right away. Unfortunately there is no function to compute the length of the wide character string directly from the multibyte string. But there is a function which does part of the work.
size_t function>mbrlen/function> (const char *restrict s, size_t n, mbstate_t *ps) The mbrlen function ("multibyte restartable length") computes the number of at most n bytes starting at s which form the next valid and complete multibyte character.
If the next multibyte character corresponds to the NUL wide character the return value is 0. If the next n bytes form a valid multibyte character the number of bytes belonging to this multibyte character byte sequence is returned.
If the the first n bytes possibly form a valid multibyte character but it is incomplete the return value is (size_t) -2. Otherwise the multibyte character sequence is invalid and the return value is (size_t) -1.
The multibyte sequence is interpreted in the state represented by the object pointed to by ps. If ps is a null pointer, a state object local to mbrlen is used.
This function was introduced in Amendment 1 to ISO C90 and is declared in wchar.h.
The tentative reader now will of course note that mbrlen can be implemented as
mbrtowc (NULL, s, n, ps != NULL ? ps : internal)
This is true and in fact is mentioned in the official specification. Now, how can this function be used to determine the length of the wide character string created from a multibyte character string? It is not directly usable but we can define a function mbslen using it:
size_t mbslen (const char *s) { mbstate_t state; size_t result = 0; size_t nbytes; memset (state, '\0', sizeof (state)); while ((nbytes = mbrlen (s, MB_LEN_MAX, state)) 0) { if (nbytes = (size_t) -2) /* Something is wrong. */ return (size_t) -1; s += nbytes; ++result; } return result; }
This function simply calls mbrlen for each multibyte character in the string and counts the number of function calls. Please note that we here use MB_LEN_MAX as the size argument in the mbrlen call. This is OK since a) this value is larger then the length of the longest multibyte character sequence and b) because we know that the string s ends with a NUL byte which cannot be part of any other multibyte character sequence but the one representing the NUL wide character. Therefore the mbrlen function will never read invalid memory.
Now that this function is available (just to make this clear, this function is not part of the GNU C library) we can compute the number of wide character required to store the converted multibyte character string s using
wcs_bytes = (mbslen (s) + 1) * sizeof (wchar_t);
Please note that the mbslen function is quite inefficient. The implementation of mbstouwcs implemented using mbslen would have to perform the conversion of the multibyte character input string twice and this conversion might be quite expensive. So it is necessary to think about the consequences of using the easier but imprecise method before doing the work twice.
size_t function>wcrtomb/function> (char *restrict s, wchar_t wc, mbstate_t *restrict ps) The wcrtomb function ("wide character restartable to multibyte") converts a single wide character into a multibyte string corresponding to that wide character.
If s is a null pointer the function resets the the state stored in the objects pointer to by ps (or the internal mbstate_t object) to the initial state. This can also be achieved by a call like this:
wcrtombs (temp_buf, L'\0', ps)
since if s is a null pointer wcrtomb performs as if it writes into an internal buffer which is guaranteed to be large enough.
If wc is the NUL wide character wcrtomb emits, if necessary, a shift sequence to get the state ps into the initial state followed by a single NUL byte is stored in the string s.
Otherwise a byte sequence (possibly including shift sequences) is written into the string s. This of only happens if wc is a valid wide character, i.e., it has a multibyte representation in the character set selected by locale of the LC_CTYPE category. If wc is no valid wide character nothing is stored in the strings s, errno is set to EILSEQ, the conversion state in ps is undefined and the return value is (size_t) -1.
If no error occurred the function returns the number of bytes stored in the string s. This includes all byte representing shift sequences.
One word about the interface of the function: there is no parameter specifying the length of the array s. Instead the function assumes that there are at least MB_CUR_MAX bytes available since this is the maximum length of any byte sequence representing a single character. So the caller has to make sure that there is enough space available, otherwise buffer overruns can occur.
This function was introduced in Amendment 1 to ISO C90 and is declared in wchar.h.
Using this function is as easy as using mbrtowc. The following example appends a wide character string to a multibyte character string. Again, the code is not really useful (and correct), it is simply here to demonstrate the use and some problems.
char * mbscatwcs (char *s, size_t len, const wchar_t *ws) { mbstate_t state; /* Find the end of the existing string. */ char *wp = strchr (s, '\0'); len -= wp - s; memset (state, '\0', sizeof (state)); do { size_t nbytes; if (len MB_CUR_LEN) { /* We cannot guarantee that the next character fits into the buffer, so return an error. */ errno = E2BIG; return NULL; } nbytes = wcrtomb (wp, *ws, state); if (nbytes == (size_t) -1) /* Error in the conversion. */ return NULL; len -= nbytes; wp += nbytes; } while (*ws++ != L'\0'); return s; }
First the function has to find the end of the string currently in the array s. The strchr call does this very efficiently since a requirement for multibyte character representations is that the NUL byte never is used except to represent itself (and in this context, the end of the string).
After initializing the state object the loop is entered where the first task is to make sure there is enough room in the array s. We abort if there are not at least MB_CUR_LEN bytes available. This is not always optimal but we have no other choice. We might have less than MB_CUR_LEN bytes available but the next multibyte character might also be only one byte long. At the time the wcrtomb call returns it is too late to decide whether the buffer was large enough or not. If this solution is really unsuitable there is a very slow but more accurate solution.
... if (len MB_CUR_LEN) { mbstate_t temp_state; memcpy (temp_state, state, sizeof (state)); if (wcrtomb (NULL, *ws, temp_state) len) { /* We cannot guarantee that the next character fits into the buffer, so return an error. */ errno = E2BIG; return NULL; } } ...
Here we do perform the conversion which might overflow the buffer so that we are afterwards in the position to make an exact decision about the buffer size. Please note the NULL argument for the destination buffer in the new wcrtomb call; since we are not interested in the converted text at this point this is a nice way to express this. The most unusual thing about this piece of code certainly is the duplication of the conversion state object. But think about this: if a change of the state is necessary to emit the next multibyte character we want to have the same shift state change performed in the real conversion. Therefore we have to preserve the initial shift state information.
There are certainly many more and even better solutions to this problem. This example is only meant for educational purposes.
The functions described in the previous section only convert a single character at a time. Most operations to be performed in real-world programs include strings and therefore the ISO C standard also defines conversions on entire strings. However, the defined set of functions is quite limited, thus the GNU C library contains a few extensions which can help in some important situations.
size_t function>mbsrtowcs/function> (wchar_t *restrict dst, const char **restrict src, size_t len, mbstate_t *restrict ps) The mbsrtowcs function ("multibyte string restartable to wide character string") converts an NUL terminated multibyte character string at *src into an equivalent wide character string, including the NUL wide character at the end. The conversion is started using the state information from the object pointed to by ps or from an internal object of mbsrtowcs if ps is a null pointer. Before returning the state object to match the state after the last converted character. The state is the initial state if the terminating NUL byte is reached and converted.
If dst is not a null pointer the result is stored in the array pointed to by dst, otherwise the conversion result is not available since it is stored in an internal buffer.
If len wide characters are stored in the array dst before reaching the end of the input string the conversion stops and len is returned. If dst is a null pointer len is never checked.
Another reason for a premature return from the function call is if the input string contains an invalid multibyte sequence. In this case the global variable errno is set to EILSEQ and the function returns (size_t) -1.
In all other cases the function returns the number of wide characters converted during this call. If dst is not null mbsrtowcs stores in the pointer pointed to by src a null pointer (if the NUL byte in the input string was reached) or the address of the byte following the last converted multibyte character.
This function was introduced in Amendment 1 to ISO C90 and is declared in wchar.h.
The definition of this function has one limitation which has to be understood. The requirement that dst has to be a NUL terminated string provides problems if one wants to convert buffers with text. A buffer is normally no collection of NUL terminated strings but instead a continuous collection of lines, separated by newline characters. Now assume a function to convert one line from a buffer is needed. Since the line is not NUL terminated the source pointer cannot directly point into the unmodified text buffer. This means, either one inserts the NUL byte at the appropriate place for the time of the mbsrtowcs function call (which is not doable for a read-only buffer or in a multi-threaded application) or one copies the line in an extra buffer where it can be terminated by a NUL byte. Note that it is not in general possible to limit the number of characters to convert by setting the parameter len to any specific value. Since it is not known how many bytes each multibyte character sequence is in length one always could do only a guess.
There is still a problem with the method of NUL-terminating a line right after the newline character which could lead to very strange results. As said in the description of the mbsrtowcs function above the conversion state is guaranteed to be in the initial shift state after processing the NUL byte at the end of the input string. But this NUL byte is not really part of the text. I.e., the conversion state after the newline in the original text could be something different than the initial shift state and therefore the first character of the next line is encoded using this state. But the state in question is never accessible to the user since the conversion stops after the NUL byte (which resets the state). Most stateful character sets in use today require that the shift state after a newline is the initial state-but this is not a strict guarantee. Therefore simply NUL terminating a piece of a running text is not always an adequate solution and therefore never should be used in generally used code.
The generic conversion interface (the section called “Generic Charset Conversion”) does not have this limitation (it simply works on buffers, not strings), and the GNU C library contains a set of functions which take additional parameters specifying the maximal number of bytes which are consumed from the input string. This way the problem of mbsrtowcs's example above could be solved by determining the line length and passing this length to the function.
size_t function>wcsrtombs/function> (char *restrict dst, const wchar_t **restrict src, size_t len, mbstate_t *restrict ps) The wcsrtombs function ("wide character string restartable to multibyte string") converts the NUL terminated wide character string at *src into an equivalent multibyte character string and stores the result in the array pointed to by dst. The NUL wide character is also converted. The conversion starts in the state described in the object pointed to by ps or by a state object locally to wcsrtombs in case ps is a null pointer. If dst is a null pointer the conversion is performed as usual but the result is not available. If all characters of the input string were successfully converted and if dst is not a null pointer the pointer pointed to by src gets assigned a null pointer.
If one of the wide characters in the input string has no valid multibyte character equivalent the conversion stops early, sets the global variable errno to EILSEQ, and returns (size_t) -1.
Another reason for a premature stop is if dst is not a null pointer and the next converted character would require more than len bytes in total to the array dst. In this case (and if dest is not a null pointer) the pointer pointed to by src is assigned a value pointing to the wide character right after the last one successfully converted.
Except in the case of an encoding error the return value of the function is the number of bytes in all the multibyte character sequences stored in dst. Before returning the state in the object pointed to by ps (or the internal object in case ps is a null pointer) is updated to reflect the state after the last conversion. The state is the initial shift state in case the terminating NUL wide character was converted.
This function was introduced in Amendment 1 to ISO C90 and is declared in wchar.h.
The restriction mentions above for the mbsrtowcs function applies also here. There is no possibility to directly control the number of input characters. One has to place the NUL wide character at the correct place or control the consumed input indirectly via the available output array size (the len parameter).
size_t function>mbsnrtowcs/function> (wchar_t *restrict dst, const char **restrict src, size_t nmc, size_t len, mbstate_t *restrict ps) The mbsnrtowcs function is very similar to the mbsrtowcs function. All the parameters are the same except for nmc which is new. The return value is the same as for mbsrtowcs.
This new parameter specifies how many bytes at most can be used from the multibyte character string. I.e., the multibyte character string *src need not be NUL terminated. But if a NUL byte is found within the nmc first bytes of the string the conversion stops here.
This function is a GNU extensions. It is meant to work around the problems mentioned above. Now it is possible to convert buffer with multibyte character text piece for piece without having to care about inserting NUL bytes and the effect of NUL bytes on the conversion state.
A function to convert a multibyte string into a wide character string and display it could be written like this (this is not a really useful example):
void showmbs (const char *src, FILE *fp) { mbstate_t state; int cnt = 0; memset (state, '\0', sizeof (state)); while (1) { wchar_t linebuf[100]; const char *endp = strchr (src, '\n'); size_t n; /* Exit if there is no more line. */ if (endp == NULL) break; n = mbsnrtowcs (linebuf, src, endp - src, 99, state); linebuf[n] = L'\0'; fprintf (fp, "line %d: \"%S\"\n", linebuf); } }
There is no problem with the state after a call to mbsnrtowcs. Since we don't insert characters in the strings which were not in there right from the beginning and we use state only for the conversion of the given buffer there is no problem with altering the state.
size_t function>wcsnrtombs/function> (char *restrict dst, const wchar_t **restrict src, size_t nwc, size_t len, mbstate_t *restrict ps) The wcsnrtombs function implements the conversion from wide character strings to multibyte character strings. It is similar to wcsrtombs but it takes, just like mbsnrtowcs, an extra parameter which specifies the length of the input string.
No more than nwc wide characters from the input string *src are converted. If the input string contains a NUL wide character in the first nwc character to conversion stops at this place.
This function is a GNU extension and just like mbsnrtowcs is helps in situations where no NUL terminated input strings are available.
The example programs given in the last sections are only brief and do not contain all the error checking etc. Presented here is a complete and documented example. It features the mbrtowc function but it should be easy to derive versions using the other functions.
int
file_mbsrtowcs (int input, int output)
{
/* Note the use of MB_LEN_MAX.
MB_CUR_MAX cannot portably be used here. */
char buffer[BUFSIZ + MB_LEN_MAX];
mbstate_t state;
int filled = 0;
int eof = 0;
/* Initialize the state. */
memset (state, '\0', sizeof (state));
while (!eof)
{
ssize_t nread;
ssize_t nwrite;
char *inp = buffer;
wchar_t outbuf[BUFSIZ];
wchar_t *outp = outbuf;
/* Fill up the buffer from the input file. */
nread = read (input, buffer + filled, BUFSIZ);
if (nread 0)
{
perror ("read");
return 0;
}
/* If we reach end of file, make a note to read no more. */
if (nread == 0)
eof = 1;
/* filled is now the number of bytes in buffer. */
filled += nread;
/* Convert those bytes to wide characters--as many as we can. */
while (1)
{
size_t thislen = mbrtowc (outp, inp, filled, state);
/* Stop converting at invalid character;
this can mean we have read just the first part
of a valid character. */
if (thislen == (size_t) -1)
break;
/* We want to handle embedded NUL bytes
but the return value is 0. Correct this. */
if (thislen == 0)
thislen = 1;
/* Advance past this character. */
inp += thislen;
filled -= thislen;
++outp;
}
/* Write the wide characters we just made. */
nwrite = write (output, outbuf,
(outp - outbuf) * sizeof (wchar_t));
if (nwrite 0)
{
perror ("write");
return 0;
}
/* See if we have a real invalid character. */
if ((eof filled 0) || filled = MB_CUR_MAX)
{
error (0, 0, "invalid multibyte character");
return 0;
}
/* If any characters must be carried forward,
put them at the beginning of buffer. */
if (filled 0)
memmove (inp, buffer, filled);
}
return 1;
}