Non-reentrant Conversion Function

The functions described in the last chapter are defined in Amendment 1 to ISO C90. But the original ISO C90 standard also contained functions for character set conversion. The reason that they are not described in the first place is that they are almost entirely useless.

The problem is that all the functions for conversion defined in ISO C90 use a local state. This implies that multiple conversions at the same time (not only when using threads) cannot be done, and that you cannot first convert single characters and then strings since you cannot tell the conversion functions which state to use.

These functions are therefore usable only in a very limited set of situations. One must complete converting the entire string before starting a new one and each string/text must be converted with the same function (there is no problem with the library itself; it is guaranteed that no library function changes the state of any of these functions). For the above reasons it is highly requested that the functions from the last section are used in place of non-reentrant conversion functions.

Non-reentrant Conversion of Single Characters

int function>mbtowc/function> (wchar_t *restrict result, const char *restrict string, size_t size) The mbtowc ("multibyte to wide character") function when called with non-null string converts the first multibyte character beginning at string to its corresponding wide character code. It stores the result in *result.

mbtowc never examines more than size bytes. (The idea is to supply for size the number of bytes of data you have in hand.)

mbtowc with non-null string distinguishes three possibilities: the first size bytes at string start with valid multibyte character, they start with an invalid byte sequence or just part of a character, or string points to an empty string (a null character).

For a valid multibyte character, mbtowc converts it to a wide character and stores that in *result, and returns the number of bytes in that character (always at least 1, and never more than size).

For an invalid byte sequence, mbtowc returns -1. For an empty string, it returns 0, also storing '\0' in *result.

If the multibyte character code uses shift characters, then mbtowc maintains and updates a shift state as it scans. If you call mbtowc with a null pointer for string, that initializes the shift state to its standard initial value. It also returns nonzero if the multibyte character code in use actually has a shift state. the section called “States in Non-reentrant Functions”.

int function>wctomb/function> (char *string, wchar_t wchar) The wctomb ("wide character to multibyte") function converts the wide character code wchar to its corresponding multibyte character sequence, and stores the result in bytes starting at string. At most MB_CUR_MAX characters are stored.

wctomb with non-null string distinguishes three possibilities for wchar: a valid wide character code (one that can be translated to a multibyte character), an invalid code, and L'\0'.

Given a valid code, wctomb converts it to a multibyte character, storing the bytes starting at string. Then it returns the number of bytes in that character (always at least 1, and never more than MB_CUR_MAX).

If wchar is an invalid wide character code, wctomb returns -1. If wchar is L'\0', it returns 0, also storing '\0' in *string.

If the multibyte character code uses shift characters, then wctomb maintains and updates a shift state as it scans. If you call wctomb with a null pointer for string, that initializes the shift state to its standard initial value. It also returns nonzero if the multibyte character code in use actually has a shift state. the section called “States in Non-reentrant Functions”.

Calling this function with a wchar argument of zero when string is not null has the side-effect of reinitializing the stored shift state as well as storing the multibyte character '\0' and returning 0.

Similar to mbrlen there is also a non-reentrant function which computes the length of a multibyte character. It can be defined in terms of mbtowc.

int function>mblen/function> (const char *string, size_t size) The mblen function with a non-null string argument returns the number of bytes that make up the multibyte character beginning at string, never examining more than size bytes. (The idea is to supply for size the number of bytes of data you have in hand.)

The return value of mblen distinguishes three possibilities: the first size bytes at string start with valid multibyte character, they start with an invalid byte sequence or just part of a character, or string points to an empty string (a null character).

For a valid multibyte character, mblen returns the number of bytes in that character (always at least 1, and never more than size). For an invalid byte sequence, mblen returns -1. For an empty string, it returns 0.

If the multibyte character code uses shift characters, then mblen maintains and updates a shift state as it scans. If you call mblen with a null pointer for string, that initializes the shift state to its standard initial value. It also returns a nonzero value if the multibyte character code in use actually has a shift state. the section called “States in Non-reentrant Functions”.

The function mblen is declared in stdlib.h.

Non-reentrant Conversion of Strings

For convenience reasons the ISO C90 standard defines also functions to convert entire strings instead of single characters. These functions suffer from the same problems as their reentrant counterparts from Amendment 1 to ISO C90; see the section called “Converting Multibyte and Wide Character Strings”.

size_t function>mbstowcs/function> (wchar_t *wstring, const char *string, size_t size) The mbstowcs ("multibyte string to wide character string") function converts the null-terminated string of multibyte characters string to an array of wide character codes, storing not more than size wide characters into the array beginning at wstring. The terminating null character counts towards the size, so if size is less than the actual number of wide characters resulting from string, no terminating null character is stored.

The conversion of characters from string begins in the initial shift state.

If an invalid multibyte character sequence is found, this function returns a value of -1. Otherwise, it returns the number of wide characters stored in the array wstring. This number does not include the terminating null character, which is present if the number is less than size.

Here is an example showing how to convert a string of multibyte characters, allocating enough space for the result.

wchar_t *
mbstowcs_alloc (const char *string)
{
  size_t size = strlen (string) + 1;
  wchar_t *buf = xmalloc (size * sizeof (wchar_t));

  size = mbstowcs (buf, string, size);
  if (size == (size_t) -1)
    return NULL;
  buf = xrealloc (buf, (size + 1) * sizeof (wchar_t));
  return buf;
}

size_t function>wcstombs/function> (char *string, const wchar_t *wstring, size_t size) The wcstombs ("wide character string to multibyte string") function converts the null-terminated wide character array wstring into a string containing multibyte characters, storing not more than size bytes starting at string, followed by a terminating null character if there is room. The conversion of characters begins in the initial shift state.

The terminating null character counts towards the size, so if size is less than or equal to the number of bytes needed in wstring, no terminating null character is stored.

If a code that does not correspond to a valid multibyte character is found, this function returns a value of -1. Otherwise, the return value is the number of bytes stored in the array string. This number does not include the terminating null character, which is present if the number is less than size.

States in Non-reentrant Functions

In some multibyte character codes, the meaning of any particular byte sequence is not fixed; it depends on what other sequences have come earlier in the same string. Typically there are just a few sequences that can change the meaning of other sequences; these few are called shift sequences and we say that they set the shift state for other sequences that follow.

To illustrate shift state and shift sequences, suppose we decide that the sequence 0200 (just one byte) enters Japanese mode, in which pairs of bytes in the range from 0240 to 0377 are single characters, while 0201 enters Latin-1 mode, in which single bytes in the range from 0240 to 0377 are characters, and interpreted according to the ISO Latin-1 character set. This is a multibyte code which has two alternative shift states ("Japanese mode" and "Latin-1 mode"), and two shift sequences that specify particular shift states.

When the multibyte character code in use has shift states, then mblen, mbtowc and wctomb must maintain and update the current shift state as they scan the string. To make this work properly, you must follow these rules:

  • Before starting to scan a string, call the function with a null pointer for the multibyte character address--for example, mblen (NULL, 0). This initializes the shift state to its standard initial value.

  • Scan the string one character at a time, in order. Do not "back up" and rescan characters already scanned, and do not intersperse the processing of different strings.

Here is an example of using mblen following these rules:

void
scan_string (char *s)
{
  int length = strlen (s);

  /* Initialize shift state.  */
  mblen (NULL, 0);

  while (1)
    {
      int thischar = mblen (s, length);
      /* Deal with end of string and invalid characters.  */
      if (thischar == 0)
        break;
      if (thischar == -1)
        {
          error ("invalid multibyte character");
          break;
        }
      /* Advance past this character.  */
      s += thischar;
      length -= thischar;
    }
}

The functions mblen, mbtowc and wctomb are not reentrant when using a multibyte code that uses a shift state. However, no other library functions call these functions, so you don't have to worry that the shift state will be changed mysteriously.