Generic Charset Conversion

The conversion functions mentioned so far in this chapter all had in common that they operate on character sets which are not directly specified by the functions. The multibyte encoding used is specified by the currently selected locale for the LC_CTYPE category. The wide character set is fixed by the implementation (in the case of GNU C library it always is UCS-4 encoded ISO 10646.

This has of course several problems when it comes to general character conversion:

The XPG2 standard defines a completely new set of functions which has none of these limitations. They are not at all coupled to the selected locales and they but no constraints on the character sets selected for source and destination. Only the set of available conversions is limiting them. The standard does not specify that any conversion at all must be available. It is a measure of the quality of the implementation.

In the following text first the interface to iconv, the conversion function, will be described. Comparisons with other implementations will show what pitfalls lie on the way of portable applications. At last, the implementation is described as far as interesting to the advanced user who wants to extend the conversion capabilities.

Generic Character Set Conversion Interface

This set of functions follows the traditional cycle of using a resource: open-use-close. The interface consists of three functions, each of which implement one step.

Before the interfaces are described it is necessary to introduce a datatype. Just like other open-use-close interface the functions introduced here work using a handles and the iconv.h header defines a special type for the handles used.

function>iconv_t/function> This data type is an abstract type defined in iconv.h. The user must not assume anything about the definition of this type, it must be completely opaque.

Objects of this type can get assigned handles for the conversions using the iconv functions. The objects themselves need not be freed but the conversions for which the handles stand for have to.

The first step is the function to create a handle.

iconv_t function>iconv_open/function> (const char *tocode, const char *fromcode) The iconv_open function has to be used before starting a conversion. The two parameters this function takes determine the source and destination character set for the conversion and if the implementation has the possibility to perform such a conversion the function returns a handle.

If the wanted conversion is not available the function returns (iconv_t) -1. In this case the global variable errno can have the following values:

EMFILE

The process already has OPEN_MAX file descriptors open.

ENFILE

The system limit of open file is reached.

ENOMEM

Not enough memory to carry out the operation.

EINVAL

The conversion from fromcode to tocode is not supported.

It is not possible to use the same descriptor in different threads to perform independent conversions. Within the data structures associated with the descriptor there is information about the conversion state. This must not be messed up by using it in different conversions.

An iconv descriptor is like a file descriptor as for every use a new descriptor must be created. The descriptor does not stand for all of the conversions from fromset to toset.

The GNU C library implementation of iconv_open has one significant extension to other implementations. To ease the extension of the set of available conversions the implementation allows storing the necessary files with data and code in arbitrarily many directories. How this extension has to be written will be explained below (the section called “The iconv Implementation in the GNU C library ”). Here it is only important to say that all directories mentioned in the GCONV_PATH environment variable are considered if they contain a file gconv-modules. These directories need not necessarily be created by the system administrator. In fact, this extension is introduced to help users writing and using their own, new conversions. Of course this does not work for security reasons in SUID binaries; in this case only the system directory is considered and this normally is prefix/lib/gconv. The GCONV_PATH environment variable is examined exactly once at the first call of the iconv_open function. Later modifications of the variable have no effect.

This function got introduced early in the X/Open Portability Guide, version 2. It is supported by all commercial Unices as it is required for the Unix branding. However, the quality and completeness of the implementation varies widely. The function is declared in iconv.h.

The iconv implementation can associate large data structure with the handle returned by iconv_open. Therefore it is crucial to free all the resources once all conversions are carried out and the conversion is not needed anymore.

int function>iconv_close/function> (iconv_t cd) The iconv_close function frees all resources associated with the handle cd which must have been returned by a successful call to the iconv_open function.

If the function call was successful the return value is 0. Otherwise it is -1 and errno is set appropriately. Defined error are:

EBADF

The conversion descriptor is invalid.

This function was introduced together with the rest of the iconv functions in XPG2 and it is declared in iconv.h.

The standard defines only one actual conversion function. This has therefore the most general interface: it allows conversion from one buffer to another. Conversion from a file to a buffer, vice versa, or even file to file can be implemented on top of it.

size_t function>iconv/function> (iconv_t cd, char **inbuf, size_t *inbytesleft, char **outbuf, size_t *outbytesleft) The iconv function converts the text in the input buffer according to the rules associated with the descriptor cd and stores the result in the output buffer. It is possible to call the function for the same text several times in a row since for stateful character sets the necessary state information is kept in the data structures associated with the descriptor.

The input buffer is specified by *inbuf and it contains *inbytesleft bytes. The extra indirection is necessary for communicating the used input back to the caller (see below). It is important to note that the buffer pointer is of type char and the length is measured in bytes even if the input text is encoded in wide characters.

The output buffer is specified in a similar way. *outbuf points to the beginning of the buffer with at least *outbytesleft bytes room for the result. The buffer pointer again is of type char and the length is measured in bytes. If outbuf or *outbuf is a null pointer the conversion is performed but no output is available.

If inbuf is a null pointer the iconv function performs the necessary action to put the state of the conversion into the initial state. This is obviously a no-op for non-stateful encodings, but if the encoding has a state such a function call might put some byte sequences in the output buffer which perform the necessary state changes. The next call with inbuf not being a null pointer then simply goes on from the initial state. It is important that the programmer never makes any assumption on whether the conversion has to deal with states or not. Even if the input and output character sets are not stateful the implementation might still have to keep states. This is due to the implementation chosen for the GNU C library as it is described below. Therefore an iconv call to reset the state should always be performed if some protocol requires this for the output text.

The conversion stops for three reasons. The first is that all characters from the input buffer are converted. This actually can mean two things: really all bytes from the input buffer are consumed or there are some bytes at the end of the buffer which possibly can form a complete character but the input is incomplete. The second reason for a stop is when the output buffer is full. And the third reason is that the input contains invalid characters.

In all these cases the buffer pointers after the last successful conversion, for input and output buffer, are stored in inbuf and outbuf and the available room in each buffer is stored in inbytesleft and outbytesleft.

Since the character sets selected in the iconv_open call can be almost arbitrary there can be situations where the input buffer contains valid characters which have no identical representation in the output character set. The behavior in this situation is undefined. The current behavior of the GNU C library in this situation is to return with an error immediately. This certainly is not the most desirable solution. Therefore future versions will provide better ones but they are not yet finished.

If all input from the input buffer is successfully converted and stored in the output buffer the function returns the number of non-reversible conversions performed. In all other cases the return value is (size_t) -1 and errno is set appropriately. In this case the value pointed to by inbytesleft is nonzero.

EILSEQ

The conversion stopped because of an invalid byte sequence in the input. After the call *inbuf points at the first byte of the invalid byte sequence.

E2BIG

The conversion stopped because it ran out of space in the output buffer.

EINVAL

The conversion stopped because of an incomplete byte sequence at the end of the input buffer.

EBADF

The cd argument is invalid.

This function was introduced in the XPG2 standard and is declared in the iconv.h header.

The definition of the iconv function is quite good overall. It provides quite flexible functionality. The only problems lie in the boundary cases which are incomplete byte sequences at the end of the input buffer and invalid input. A third problem, which is not really a design problem, is the way conversions are selected. The standard does not say anything about the legitimate names, a minimal set of available conversions. We will see how this negatively impacts other implementations, as is demonstrated below.

A complete iconvexample

The example below features a solution for a common problem. Given that one knows the internal encoding used by the system for wchar_t strings one often is in the position to read text from a file and store it in wide character buffers. One can do this using mbsrtowcs but then we run into the problems discussed above.

int
file2wcs (int fd, const char *charset, wchar_t *outbuf, size_t avail)
{
  char inbuf[BUFSIZ];
  size_t insize = 0;
  char *wrptr = (char *) outbuf;
  int result = 0;
  iconv_t cd;

  cd = iconv_open ("WCHAR_T", charset);
  if (cd == (iconv_t) -1)
    {
      /* Something went wrong.  */
      if (errno == EINVAL)
        error (0, 0, "conversion from '%s' to wchar_t not available",
               charset);
      else
        perror ("iconv_open");

      /* Terminate the output string.  */
      *outbuf = L'\0';

      return -1;
    }

  while (avail  0)
    {
      size_t nread;
      size_t nconv;
      char *inptr = inbuf;

      /* Read more input.  */
      nread = read (fd, inbuf + insize, sizeof (inbuf) - insize);
      if (nread == 0)
        {
          /* When we come here the file is completely read.
             This still could mean there are some unused
             characters in the inbuf.  Put them back.  */
          if (lseek (fd, -insize, SEEK_CUR) == -1)
            result = -1;

          /* Now write out the byte sequence to get into the
             initial state if this is necessary.  */
          iconv (cd, NULL, NULL, wrptr, avail);

          break;
        }
      insize += nread;

      /* Do the conversion.  */
      nconv = iconv (cd, inptr, insize, wrptr, avail);
      if (nconv == (size_t) -1)
        {
          /* Not everything went right.  It might only be
             an unfinished byte sequence at the end of the
             buffer.  Or it is a real problem.  */
          if (errno == EINVAL)
            /* This is harmless.  Simply move the unused
               bytes to the beginning of the buffer so that
               they can be used in the next round.  */
            memmove (inbuf, inptr, insize);
          else
            {
              /* It is a real problem.  Maybe we ran out of
                 space in the output buffer or we have invalid
                 input.  In any case back the file pointer to
                 the position of the last processed byte.  */
              lseek (fd, -insize, SEEK_CUR);
              result = -1;
              break;
            }
        }
    }

  /* Terminate the output string.  */
  if (avail = sizeof (wchar_t))
    *((wchar_t *) wrptr) = L'\0';

  if (iconv_close (cd) != 0)
    perror ("iconv_close");

  return (wchar_t *) wrptr - outbuf;
}

This example shows the most important aspects of using the iconv functions. It shows how successive calls to iconv can be used to convert large amounts of text. The user does not have to care about stateful encodings as the functions take care of everything.

An interesting point is the case where iconv return an error and errno is set to EINVAL. This is not really an error in the transformation. It can happen whenever the input character set contains byte sequences of more than one byte for some character and texts are not processed in one piece. In this case there is a chance that a multibyte sequence is cut. The caller than can simply read the remainder of the takes and feed the offending bytes together with new character from the input to iconv and continue the work. The internal state kept in the descriptor is not unspecified after such an event as it is the case with the conversion functions from the ISO C standard.

The example also shows the problem of using wide character strings with iconv. As explained in the description of the iconv function above the function always takes a pointer to a char array and the available space is measured in bytes. In the example the output buffer is a wide character buffer. Therefore we use a local variable wrptr of type char * which is used in the iconv calls.

This looks rather innocent but can lead to problems on platforms which have tight restriction on alignment. Therefore the caller of iconv has to make sure that the pointers passed are suitable for access of characters from the appropriate character set. Since in the above case the input parameter to the function is a wchar_t pointer this is the case (unless the user violates alignment when computing the parameter). But in other situations, especially when writing generic functions where one does not know what type of character set one uses and therefore treats text as a sequence of bytes, it might become tricky.

Some Details about other iconvImplementations

This is not really the place to discuss the iconv implementation of other systems but it is necessary to know a bit about them to write portable programs. The above mentioned problems with the specification of the iconv functions can lead to portability issues.

The first thing to notice is that due to the large number of character sets in use it is certainly not practical to encode the conversions directly in the C library. Therefore the conversion information must come from files outside the C library. This is usually done in one or both of the following ways:

  • The C library contains a set of generic conversion functions which can read the needed conversion tables and other information from data files. These files get loaded when necessary.

    This solution is problematic as it requires a great deal of effort to apply to all character sets (potentially an infinite set). The differences in the structure of the different character sets is so large that many different variants of the table processing functions must be developed. On top of this the generic nature of these functions make them slower than specifically implemented functions.

  • The C library only contains a framework which can dynamically load object files and execute the therein contained conversion functions.

    This solution provides much more flexibility. The C library itself contains only very little code and therefore reduces the general memory footprint. Also, with a documented interface between the C library and the loadable modules it is possible for third parties to extend the set of available conversion modules. A drawback of this solution is that dynamic loading must be available.

Some implementations in commercial Unices implement a mixture of these these possibilities, the majority only the second solution. Using loadable modules moves the code out of the library itself and keeps the door open for extensions and improvements. But this design is also limiting on some platforms since not many platforms support dynamic loading in statically linked programs. On platforms without his capability it is therefore not possible to use this interface in statically linked programs. The GNU C library has on ELF platforms no problems with dynamic loading in in these situations and therefore this point is moot. The danger is that one gets acquainted with this and forgets about the restrictions on other systems.

A second thing to know about other iconv implementations is that the number of available conversions is often very limited. Some implementations provide in the standard release (not special international or developer releases) at most 100 to 200 conversion possibilities. This does not mean 200 different character sets are supported. E.g., conversions from one character set to a set of, say, 10 others counts as 10 conversion. Together with the other direction this makes already 20. One can imagine the thin coverage these platform provide. Some Unix vendors even provide only a handful of conversions which renders them useless for almost all uses.

This directly leads to a third and probably the most problematic point. The way the iconv conversion functions are implemented on all known Unix system and the availability of the conversion functions from character set A to B and the conversion from B to C does not imply that the conversion from A to C is available.

This might not seem unreasonable and problematic at first but it is a quite big problem as one will notice shortly after hitting it. To show the problem we assume to write a program which has to convert from A to C. A call like

cd = iconv_open ("C", "A");

does fail according to the assumption above. But what does the program do now? The conversion is really necessary and therefore simply giving up is no possibility.

This is a nuisance. The iconv function should take care of this. But how should the program proceed from here on? If it would try to convert to character set B first the two iconv_open calls

cd1 = iconv_open ("B", "A");

and

cd2 = iconv_open ("C", "B");

will succeed but how to find B?

Unfortunately, the answer is: there is no general solution. On some systems guessing might help. On those systems most character sets can convert to and from UTF-8 encoded ISO 10646 or Unicode text. Beside this only some very system-specific methods can help. Since the conversion functions come from loadable modules and these modules must be stored somewhere in the filesystem, one could try to find them and determine from the available file which conversions are available and whether there is an indirect route from A to C.

This shows one of the design errors of iconv mentioned above. It should at least be possible to determine the list of available conversion programmatically so that if iconv_open says there is no such conversion, one could make sure this also is true for indirect routes.

The iconv Implementation in the GNU C library

After reading about the problems of iconv implementations in the last section it is certainly good to note that the implementation in the GNU C library has none of the problems mentioned above. What follows is a step-by-step analysis of the points raised above. The evaluation is based on the current state of the development (as of January 1999). The development of the iconv functions is not complete, but basic functionality has solidified.

The GNU C library's iconv implementation uses shared loadable modules to implement the conversions. A very small number of conversions are built into the library itself but these are only rather trivial conversions.

All the benefits of loadable modules are available in the GNU C library implementation. This is especially appealing since the interface is well documented (see below) and it therefore is easy to write new conversion modules. The drawback of using loadable objects is not a problem in the GNU C library, at least on ELF systems. Since the library is able to load shared objects even in statically linked binaries this means that static linking needs not to be forbidden in case one wants to use iconv.

The second mentioned problem is the number of supported conversions. Currently, the GNU C library supports more than 150 character sets. The way the implementation is designed the number of supported conversions is greater than 22350 (150 times 149). If any conversion from or to a character set is missing it can easily be added.

Particularly impressive as it may be, this high number is due to the fact that the GNU C library implementation of iconv does not have the third problem mentioned above. I.e., whenever there is a conversion from a character set A to B and from B to C it is always possible to convert from A to C directly. If the iconv_open returns an error and sets errno to EINVAL this really means there is no known way, directly or indirectly, to perform the wanted conversion.

This is achieved by providing for each character set a conversion from and to UCS-4 encoded ISO 10646. Using ISO 10646 as an intermediate representation it is possible to triangulate, i.e., converting with an intermediate representation.

There is no inherent requirement to provide a conversion to ISO 10646 for a new character set and it is also possible to provide other conversions where neither source nor destination character set is ISO 10646. The currently existing set of conversions is simply meant to cover all conversions which might be of interest.

All currently available conversions use the triangulation method above, making conversion run unnecessarily slow. If, e.g., somebody often needs the conversion from ISO-2022-JP to EUC-JP, a quicker solution would involve direct conversion between the two character sets, skipping the input to ISO 10646 first. The two character sets of interest are much more similar to each other than to ISO 10646.

In such a situation one can easy write a new conversion and provide it as a better alternative. The GNU C library iconv implementation would automatically use the module implementing the conversion if it is specified to be more efficient.

Format of gconv-modules files

All information about the available conversions comes from a file named gconv-modules which can be found in any of the directories along the GCONV_PATH. The gconv-modules files are line-oriented text files, where each of the lines has one of the following formats:

  • If the first non-whitespace character is a # the line contains only comments and is ignored.

  • Lines starting with alias define an alias name for a character set. There are two more words expected on the line. The first one defines the alias name and the second defines the original name of the character set. The effect is that it is possible to use the alias name in the fromset or toset parameters of iconv_open and achieve the same result as when using the real character set name.

    This is quite important as a character set has often many different names. There is normally always an official name but this need not correspond to the most popular name. Beside this many character sets have special names which are somehow constructed. E.g., all character sets specified by the ISO have an alias of the form ISO-IR-nnn where nnn is the registration number. This allows programs which know about the registration number to construct character set names and use them in iconv_open calls. More on the available names and aliases follows below.

  • Lines starting with module introduce an available conversion module. These lines must contain three or four more words.

    The first word specifies the source character set, the second word the destination character set of conversion implemented in this module. The third word is the name of the loadable module. The filename is constructed by appending the usual shared object suffix (normally .so) and this file is then supposed to be found in the same directory the gconv-modules file is in. The last word on the line, which is optional, is a numeric value representing the cost of the conversion. If this word is missing a cost of 1 is assumed. The numeric value itself does not matter that much; what counts are the relative values of the sums of costs for all possible conversion paths. Below is a more precise description of the use of the cost value.

Returning to the example above where one has written a module to directly convert from ISO-2022-JP to EUC-JP and back. All what has to be done is to put the new module, be its name ISO2022JP-EUCJP.so, in a directory and add a file gconv-modules with the following content in the same directory:

module  ISO-2022-JP//   EUC-JP//        ISO2022JP-EUCJP    1
module  EUC-JP//        ISO-2022-JP//   ISO2022JP-EUCJP    1

To see why this is sufficient, it is necessary to understand how the conversion used by iconv (and described in the descriptor) is selected. The approach to this problem is quite simple.

At the first call of the iconv_open function the program reads all available gconv-modules files and builds up two tables: one containing all the known aliases and another which contains the information about the conversions and which shared object implements them.

Finding the conversion path in iconv

The set of available conversions form a directed graph with weighted edges. The weights on the edges are the costs specified in the gconv-modules files. The iconv_open function uses an algorithm suitable for search for the best path in such a graph and so constructs a list of conversions which must be performed in succession to get the transformation from the source to the destination character set.

Explaining why the above gconv-modules files allows the iconv implementation to resolve the specific ISO-2022-JP to EUC-JP conversion module instead of the conversion coming with the library itself is straightforward. Since the latter conversion takes two steps (from ISO-2022-JP to ISO 10646 and then from ISO 10646 to EUC-JP) the cost is 1+1 = 2. But the above gconv-modules file specifies that the new conversion modules can perform this conversion with only the cost of 1.

A mysterious piece about the gconv-modules file above (and also the file coming with the GNU C library) are the names of the character sets specified in the module lines. Why do almost all the names end in //? And this is not all: the names can actually be regular expressions. At this point of time this mystery should not be revealed, unless you have the relevant spell-casting materials: ashes from an original DOS 6.2 boot disk burnt in effigy, a crucifix blessed by St. Emacs, assorted herbal roots from Central America, sand from Cebu, etc. Sorry! The part of the implementation where this is used is not yet finished. For now please simply follow the existing examples. It'll become clearer once it is. -drepper

A last remark about the gconv-modules is about the names not ending with //. There often is a character set named INTERNAL mentioned. From the discussion above and the chosen name it should have become clear that this is the name for the representation used in the intermediate step of the triangulation. We have said that this is UCS-4 but actually it is not quite right. The UCS-4 specification also includes the specification of the byte ordering used. Since a UCS-4 value consists of four bytes a stored value is effected by byte ordering. The internal representation is not the same as UCS-4 in case the byte ordering of the processor (or at least the running process) is not the same as the one required for UCS-4. This is done for performance reasons as one does not want to perform unnecessary byte-swapping operations if one is not interested in actually seeing the result in UCS-4. To avoid trouble with endianess the internal representation consistently is named INTERNAL even on big-endian systems where the representations are identical.

iconvmodule data structures

So far this section described how modules are located and considered to be used. What remains to be described is the interface of the modules so that one can write new ones. This section describes the interface as it is in use in January 1999. The interface will change in future a bit but hopefully only in an upward compatible way.

The definitions necessary to write new modules are publicly available in the non-standard header gconv.h. The following text will therefore describe the definitions from this header file. But first it is necessary to get an overview.

From the perspective of the user of iconv the interface is quite simple: the iconv_open function returns a handle which can be used in calls to iconv and finally the handle is freed with a call to iconv_close. The problem is: the handle has to be able to represent the possibly long sequences of conversion steps and also the state of each conversion since the handle is all which is passed to the iconv function. Therefore the data structures are really the elements to understanding the implementation.

We need two different kinds of data structures. The first describes the conversion and the second describes the state etc. There are really two type definitions like this in gconv.h. function>struct __gconv_step/function> This data structure describes one conversion a module can perform. For each function in a loaded module with conversion functions there is exactly one object of this type. This object is shared by all users of the conversion. I.e., this object does not contain any information corresponding to an actual conversion. It only describes the conversion itself.

struct __gconv_loaded_object *__shlib_handle, const char *__modname, int __counter

All these elements of the structure are used internally in the C library to coordinate loading and unloading the shared. One must not expect any of the other elements be available or initialized.

const char *__from_name, const char *__to_name

__from_name and __to_name contain the names of the source and destination character sets. They can be used to identify the actual conversion to be carried out since one module might implement conversions for more than one character set and/or direction.

gconv_fct __fct, gconv_init_fct __init_fct, gconv_end_fct __end_fct

These elements contain pointers to the functions in the loadable module. The interface will be explained below.

int __min_needed_from, int __max_needed_from, int __min_needed_to, int __max_needed_to;

These values have to be filled in the init function of the module. The __min_needed_from value specifies how many bytes a character of the source character set at least needs. The __max_needed_from specifies the maximum value which also includes possible shift sequences.

The __min_needed_to and __max_needed_to values serve the same purpose but this time for the destination character set.

It is crucial that these values are accurate since otherwise the conversion functions will have problems or not work at all.

int __stateful

This element must also be initialized by the init function. It is nonzero if the source character set is stateful. Otherwise it is zero.

void *__data

This element can be used freely by the conversion functions in the module. It can be used to communicate extra information from one call to another. It need not be initialized if not needed at all. If this element gets assigned a pointer to dynamically allocated memory (presumably in the init function) it has to be made sure that the end function deallocates the memory. Otherwise the application will leak memory.

It is important to be aware that this data structure is shared by all users of this specification conversion and therefore the __data element must not contain data specific to one specific use of the conversion function.

function>struct __gconv_step_data/function> This is the data structure which contains the information specific to each use of the conversion functions.

char *__outbuf, char *__outbufend

These elements specify the output buffer for the conversion step. The __outbuf element points to the beginning of the buffer and __outbufend points to the byte following the last byte in the buffer. The conversion function must not assume anything about the size of the buffer but it can be safely assumed the there is room for at least one complete character in the output buffer.

Once the conversion is finished and the conversion is the last step the __outbuf element must be modified to point after last last byte written into the buffer to signal how much output is available. If this conversion step is not the last one the element must not be modified. The __outbufend element must not be modified.

int __is_last

This element is nonzero if this conversion step is the last one. This information is necessary for the recursion. See the description of the conversion function internals below. This element must never be modified.

int __invocation_counter

The conversion function can use this element to see how many calls of the conversion function already happened. Some character sets require when generating output a certain prolog and by comparing this value with zero one can find out whether it is the first call and therefore the prolog should be emitted or not. This element must never be modified.

int __internal_use

This element is another one rarely used but needed in certain situations. It got assigned a nonzero value in case the conversion functions are used to implement mbsrtowcs et.al. I.e., the function is not used directly through the iconv interface.

This sometimes makes a difference as it is expected that the iconv functions are used to translate entire texts while the mbsrtowcs functions are normally only used to convert single strings and might be used multiple times to convert entire texts.

But in this situation we would have problem complying with some rules of the character set specification. Some character sets require a prolog which must appear exactly once for an entire text. If a number of mbsrtowcs calls are used to convert the text only the first call must add the prolog. But since there is no communication between the different calls of mbsrtowcs the conversion functions have no possibility to find this out. The situation is different for sequences of iconv calls since the handle allows access to the needed information.

This element is mostly used together with __invocation_counter in a way like this:

if (!data-__internal_use
      data-__invocation_counter == 0)
  /* Emit prolog.  */
  ...

This element must never be modified.

mbstate_t *__statep

The __statep element points to an object of type mbstate_t (the section called “Representing the state of the conversion”). The conversion of an stateful character set must use the object pointed to by this element to store information about the conversion state. The __statep element itself must never be modified.

mbstate_t __state

This element never must be used directly. It is only part of this structure to have the needed space allocated.

iconvmodule interfaces

With the knowledge about the data structures we now can describe the conversion functions itself. To understand the interface a bit of knowledge about the functionality in the C library which loads the objects with the conversions is necessary.

It is often the case that one conversion is used more than once. I.e., there are several iconv_open calls for the same set of character sets during one program run. The mbsrtowcs et.al. functions in the GNU C library also use the iconv functionality which increases the number of uses of the same functions even more.

For this reason the modules do not get loaded exclusively for one conversion. Instead a module once loaded can be used by arbitrarily many iconv or mbsrtowcs calls at the same time. The splitting of the information between conversion function specific information and conversion data makes this possible. The last section showed the two data structures used to do this.

This is of course also reflected in the interface and semantics of the functions the modules must provide. There are three functions which must have the following names:

gconv_init

The gconv_init function initializes the conversion function specific data structure. This very same object is shared by all conversion which use this conversion and therefore no state information about the conversion itself must be stored in here. If a module implements more than one conversion the gconv_init function will be called multiple times.

gconv_end

The gconv_end function is responsible to free all resources allocated by the gconv_init function. If there is nothing to do this function can be missing. Special care must be taken if the module implements more than one conversion and the gconv_init function does not allocate the same resources for all conversions.

gconv

This is the actual conversion function. It is called to convert one block of text. It gets passed the conversion step information initialized by gconv_init and the conversion data, specific to this use of the conversion functions.

There are three data types defined for the three module interface function and these define the interface.

int function>(*__gconv_init_fct)/function> (struct __gconv_step *) This specifies the interface of the initialization function of the module. It is called exactly once for each conversion the module implements.

As explained int the description of the struct __gconv_step data structure above the initialization function has to initialize parts of it.

__min_needed_from, __max_needed_from, __min_needed_to, __max_needed_to

These elements must be initialized to the exact numbers of the minimum and maximum number of bytes used by one character in the source and destination character set respectively. If the characters all have the same size the minimum and maximum values are the same.

__stateful

This element must be initialized to an nonzero value if the source character set is stateful. Otherwise it must be zero.

If the initialization function needs to communication some information to the conversion function this can happen using the __data element of the __gconv_step structure. But since this data is shared by all the conversion is must not be modified by the conversion function. How this can be used is shown in the example below.

#define MIN_NEEDED_FROM         1
#define MAX_NEEDED_FROM         4
#define MIN_NEEDED_TO           4
#define MAX_NEEDED_TO           4

int
gconv_init (struct __gconv_step *step)
{
  /* Determine which direction.  */
  struct iso2022jp_data *new_data;
  enum direction dir = illegal_dir;
  enum variant var = illegal_var;
  int result;

  if (__strcasecmp (step-__from_name, "ISO-2022-JP//") == 0)
    {
      dir = from_iso2022jp;
      var = iso2022jp;
    }
  else if (__strcasecmp (step-__to_name, "ISO-2022-JP//") == 0)
    {
      dir = to_iso2022jp;
      var = iso2022jp;
    }
  else if (__strcasecmp (step-__from_name, "ISO-2022-JP-2//") == 0)
    {
      dir = from_iso2022jp;
      var = iso2022jp2;
    }
  else if (__strcasecmp (step-__to_name, "ISO-2022-JP-2//") == 0)
    {
      dir = to_iso2022jp;
      var = iso2022jp2;
    }

  result = __GCONV_NOCONV;
  if (dir != illegal_dir)
    {
      new_data = (struct iso2022jp_data *)
        malloc (sizeof (struct iso2022jp_data));

      result = __GCONV_NOMEM;
      if (new_data != NULL)
        {
          new_data-dir = dir;
          new_data-var = var;
          step-__data = new_data;

          if (dir == from_iso2022jp)
	    {
              step-__min_needed_from = MIN_NEEDED_FROM;
              step-__max_needed_from = MAX_NEEDED_FROM;
              step-__min_needed_to = MIN_NEEDED_TO;
              step-__max_needed_to = MAX_NEEDED_TO;
	    }
          else
            {
              step-__min_needed_from = MIN_NEEDED_TO;
              step-__max_needed_from = MAX_NEEDED_TO;
              step-__min_needed_to = MIN_NEEDED_FROM;
              step-__max_needed_to = MAX_NEEDED_FROM + 2;
            }

          /* Yes, this is a stateful encoding.  */
          step-__stateful = 1;

          result = __GCONV_OK;
        }
    }

  return result;
}

The function first checks which conversion is wanted. The module from which this function is taken implements four different conversion and which one is selected can be determined by comparing the names. The comparison should always be done without paying attention to the case.

Then a data structure is allocated which contains the necessary information about which conversion is selected. The data structure struct iso2022jp_data is locally defined since outside the module this data is not used at all. Please note that if all four conversions this modules supports are requested there are four data blocks.

One interesting thing is the initialization of the __min_ and __max_ elements of the step data object. A single ISO-2022-JP character can consist of one to four bytes. Therefore the MIN_NEEDED_FROM and MAX_NEEDED_FROM macros are defined this way. The output is always the INTERNAL character set (aka UCS-4) and therefore each character consists of exactly four bytes. For the conversion from INTERNAL to ISO-2022-JP we have to take into account that escape sequences might be necessary to switch the character sets. Therefore the __max_needed_to element for this direction gets assigned MAX_NEEDED_FROM + 2. This takes into account the two bytes needed for the escape sequences to single the switching. The asymmetry in the maximum values for the two directions can be explained easily: when reading ISO-2022-JP text escape sequences can be handled alone. I.e., it is not necessary to process a real character since the effect of the escape sequence can be recorded in the state information. The situation is different for the other direction. Since it is in general not known which character comes next one cannot emit escape sequences to change the state in advance. This means the escape sequences which have to be emitted together with the next character. Therefore one needs more room then only for the character itself.

The possible return values of the initialization function are:

__GCONV_OK

The initialization succeeded

__GCONV_NOCONV

The requested conversion is not supported in the module. This can happen if the gconv-modules file has errors.

__GCONV_NOMEM

Memory required to store additional information could not be allocated.

The functions called before the module is unloaded is significantly easier. It often has nothing at all to do in which case it can be left out completely.

void function>(*__gconv_end_fct)/function> (struct gconv_step *) The task of this function is it to free all resources allocated in the initialization function. Therefore only the __data element of the object pointed to by the argument is of interest. Continuing the example from the initialization function, the finalization function looks like this:

void
gconv_end (struct __gconv_step *data)
{
  free (data-__data);
}

The most important function is the conversion function itself. It can get quite complicated for complex character sets. But since this is not of interest here we will only describe a possible skeleton for the conversion function.

int function>(*__gconv_fct)/function> (struct __gconv_step *, struct __gconv_step_data *, const char **, const char *, size_t *, int) The conversion function can be called for two basic reason: to convert text or to reset the state. From the description of the iconv function it can be seen why the flushing mode is necessary. What mode is selected is determined by the sixth argument, an integer. If it is nonzero it means that flushing is selected.

Common to both mode is where the output buffer can be found. The information about this buffer is stored in the conversion step data. A pointer to this is passed as the second argument to this function. The description of the struct __gconv_step_data structure has more information on this.

What has to be done for flushing depends on the source character set. If it is not stateful nothing has to be done. Otherwise the function has to emit a byte sequence to bring the state object in the initial state. Once this all happened the other conversion modules in the chain of conversions have to get the same chance. Whether another step follows can be determined from the __is_last element of the step data structure to which the first parameter points.

The more interesting mode is when actually text has to be converted. The first step in this case is to convert as much text as possible from the input buffer and store the result in the output buffer. The start of the input buffer is determined by the third argument which is a pointer to a pointer variable referencing the beginning of the buffer. The fourth argument is a pointer to the byte right after the last byte in the buffer.

The conversion has to be performed according to the current state if the character set is stateful. The state is stored in an object pointed to by the __statep element of the step data (second argument). Once either the input buffer is empty or the output buffer is full the conversion stops. At this point the pointer variable referenced by the third parameter must point to the byte following the last processed byte. I.e., if all of the input is consumed this pointer and the fourth parameter have the same value.

What now happens depends on whether this step is the last one or not. If it is the last step the only thing which has to be done is to update the __outbuf element of the step data structure to point after the last written byte. This gives the caller the information on how much text is available in the output buffer. Beside this the variable pointed to by the fifth parameter, which is of type size_t, must be incremented by the number of characters (not bytes) which were converted in a non-reversible way. Then the function can return.

In case the step is not the last one the later conversion functions have to get a chance to do their work. Therefore the appropriate conversion function has to be called. The information about the functions is stored in the conversion data structures, passed as the first parameter. This information and the step data are stored in arrays so the next element in both cases can be found by simple pointer arithmetic:

int
gconv (struct __gconv_step *step, struct __gconv_step_data *data,
       const char **inbuf, const char *inbufend, size_t *written,
       int do_flush)
{
  struct __gconv_step *next_step = step + 1;
  struct __gconv_step_data *next_data = data + 1;
  ...

The next_step pointer references the next step information and next_data the next data record. The call of the next function therefore will look similar to this:

  next_step-__fct (next_step, next_data, outerr, outbuf,
                    written, 0)

But this is not yet all. Once the function call returns the conversion function might have some more to do. If the return value of the function is __GCONV_EMPTY_INPUT this means there is more room in the output buffer. Unless the input buffer is empty the conversion functions start all over again and processes the rest of the input buffer. If the return value is not __GCONV_EMPTY_INPUT something went wrong and we have to recover from this.

A requirement for the conversion function is that the input buffer pointer (the third argument) always points to the last character which was put in the converted form in the output buffer. This is trivially true after the conversion performed in the current step. But if the conversion functions deeper down the stream stop prematurely not all characters from the output buffer are consumed and therefore the input buffer pointers must be backed of to the right position.

This is easy to do if the input and output character sets have a fixed width for all characters. In this situation we can compute how many characters are left in the output buffer and therefore can correct the input buffer pointer appropriate with a similar computation. Things are getting tricky if either character set has character represented with variable length byte sequences and it gets even more complicated if the conversion has to take care of the state. In these cases the conversion has to be performed once again, from the known state before the initial conversion. I.e., if necessary the state of the conversion has to be reset and the conversion loop has to be executed again. The difference now is that it is known how much input must be created and the conversion can stop before converting the first unused character. Once this is done the input buffer pointers must be updated again and the function can return.

One final thing should be mentioned. If it is necessary for the conversion to know whether it is the first invocation (in case a prolog has to be emitted) the conversion function should just before returning to the caller increment the __invocation_counter element of the step data structure. See the description of the struct __gconv_step_data structure above for more information on how this can be used.

The return value must be one of the following values:

__GCONV_EMPTY_INPUT

All input was consumed and there is room left in the output buffer.

__GCONV_FULL_OUTPUT

No more room in the output buffer. In case this is not the last step this value is propagated down from the call of the next conversion function in the chain.

__GCONV_INCOMPLETE_INPUT

The input buffer is not entirely empty since it contains an incomplete character sequence.

The following example provides a framework for a conversion function. In case a new conversion has to be written the holes in this implementation have to be filled and that is it.

int
gconv (struct __gconv_step *step, struct __gconv_step_data *data,
       const char **inbuf, const char *inbufend, size_t *written,
       int do_flush)
{
  struct __gconv_step *next_step = step + 1;
  struct __gconv_step_data *next_data = data + 1;
  gconv_fct fct = next_step-__fct;
  int status;

  /* If the function is called with no input this means we have
     to reset to the initial state.  The possibly partly
     converted input is dropped.  */
  if (do_flush)
    {
      status = __GCONV_OK;

      /* Possible emit a byte sequence which put the state object
         into the initial state.  */

      /* Call the steps down the chain if there are any but only
         if we successfully emitted the escape sequence.  */
      if (status == __GCONV_OK  ! data-__is_last)
        status = fct (next_step, next_data, NULL, NULL,
                      written, 1);
    }
  else
    {
      /* We preserve the initial values of the pointer variables.  */
      const char *inptr = *inbuf;
      char *outbuf = data-__outbuf;
      char *outend = data-__outbufend;
      char *outptr;

      do
        {
          /* Remember the start value for this round.  */
          inptr = *inbuf;
          /* The outbuf buffer is empty.  */
          outptr = outbuf;

          /* For stateful encodings the state must be safe here.  */

          /* Run the conversion loop.  status is set
             appropriately afterwards.  */

          /* If this is the last step leave the loop, there is
             nothing we can do.  */
          if (data-__is_last)
            {
              /* Store information about how many bytes are
                 available.  */
              data-__outbuf = outbuf;

             /* If any non-reversible conversions were performed,
                add the number to *written.  */

             break;
           }

          /* Write out all output which was produced.  */
          if (outbuf  outptr)
            {
              const char *outerr = data-__outbuf;
              int result;

              result = fct (next_step, next_data, outerr,
                            outbuf, written, 0);

              if (result != __GCONV_EMPTY_INPUT)
                {
                  if (outerr != outbuf)
                    {
                      /* Reset the input buffer pointer.  We
                         document here the complex case.  */
                      size_t nstatus;

                      /* Reload the pointers.  */
                      *inbuf = inptr;
                      outbuf = outptr;

                      /* Possibly reset the state.  */

                      /* Redo the conversion, but this time
                         the end of the output buffer is at
                         outerr.  */
                    }

                  /* Change the status.  */
                  status = result;
                }
              else
                /* All the output is consumed, we can make
                    another run if everything was ok.  */
                if (status == __GCONV_FULL_OUTPUT)
                  status = __GCONV_OK;
           }
        }
      while (status == __GCONV_OK);

      /* We finished one use of this step.  */
      ++data-__invocation_counter;
    }

  return status;
}

This information should be sufficient to write new modules. Anybody doing so should also take a look at the available source code in the GNU C library sources. It contains many examples of working and optimized modules.