PyPy
PyPy[rctypes]

RCtypes

1   Using RCtypes

ctypes is a library for CPython that has been around for some years, and that is now integrated in Python 2.5. It allows Python programs to directly perform calls to external libraries written in C, by declaring function prototypes and all necessary data structures dynamically in Python. It is based on the platform-dependent libffi library.

RCtypes is ctypes restricted to be used from RPython programs. Basically, an RPython program that is going to be translated to C or a C-like language can use (a subset of) ctypes more or less normally.

During the translation of the RPython program to C, the source code -- which contains dynamic foreign function calls written using ctypes -- becomes static C code that performs the same calls. In other words, once compiled within an RPython program, ctypes constructs are all replaced by regular, static C code that no longer uses the libffi library. This makes ctypes a very good way to interface with any C library -- not only from the PyPy interpreter, but from any program or extension module that is written in RPython.

This document assumes some basic familiarity with ctypes. It is recommended to first look up the ctypes documentation. As an appendix to this document we also point out the Finest issues in the behavior of ctypes, which are not explicitly documented anywhere else to the best of our knowledge.

1.1   Restrictions

The main restriction that RCtypes adds over ctypes is similar to the general restriction that RPython adds over Python: you cannot be "too dynamic" at run-time, but you can be as dynamic as you want during bootstrapping, i.e. when the source code is imported.

For RCtypes, this means that all ctypes type and function declarations must be done while bootstrapping. You cannot declare new types and functions from inside RPython functions that are going to be translated to C. In practice, this is a fairly soft limitation: most modules using ctypes would typically declare all functions globally at module-level anyway.

Apart from this, you can use many ctypes constructs freely: all primitive types except c_float and c_wchar and c_wchar_p (they will be added at a later time), pointers, structures, arrays, external functions. The following ctypes functions are supported at run-time: create_string_buffer(), pointer(), POINTER(), sizeof(), cast(). (Adding support for more is easy.) There is some support for Unions, and for callbacks and a few more obscure ctypes features, but e.g. custom allocators and return-value-error-checkers are not supported.

Remember to use None as the return type in the function declaration if it returns void.

There is special support for the special variable errno of C: from the RPython program, access it with the helper function pypy.rpython.rctypes.aerrno.geterrno().

Note that the support for the POINTER() function is an exception to the rule that types should not be manipulated at run-time. You can only call it with a constant type, as in POINTER(c_int).

Another exception is that an expression like c_int * n can appear at run-time. It can be used to create variable-sized array instances, i.e. arrays whose length is not a static constant, as in my_array = (c_int * n)(). Similarly, the create_string_buffer() function returns a variable-sized array of chars.

NOTE: in order to translate an RPython program using ctypes, the module pypy.rpython.rctypes.implementation must be imported! This is required to register to the annotator/rtyper how the ctypes constructs should be translated. Just import this module from your RPython program, in addition to importing ctypes.

1.2   ctypes version and platform notes

ctypes-0.9.9.6 or later is required.

On Mac OSX 10.3 (at least with ctypes-0.9.9.6) you need to change the RTLD_LOCAL default in ctypes/__init__.py line 293 to:

def __init__(self, name, mode=RTLD_GLOBAL, handle=None):

otherwise it will complain with "unable to open this file with RTLD_LOCAL" when trying to load the C library.

2   Implementation design

We have experimented with two different implementation approaches. The first one is relatively mature, but does not do the right thing about memory management issues in the most advanced cases. Moreover, it prevents the RPython program to be compiled with a Moving Garbage Collector, which has been the major factor stopping us from experimenting with advanced GCs so far. It is described in the chapter RCtypes implemented in the RTyper.

The alternative implementation of RCtypes is work in progress; its approach is described in the chapter RCtypes implemented via a pure RPython interface.

2.1   RCtypes implemented in the RTyper

The currently available implementation of RCtypes works by integrating itself within the annotation and most importantly RTyping process of the translation toolchain: the ctypes objects that the RPython program uses, and the operations it performs on them, are individually replaced during RTyping by sequences of low-level operations performing the equivalent operation in a C-like language.

2.1.1   Annotation

Ctypes on CPython tracks all the memory it has allocated by itself, may it be referenced by pointers and structures or only pointers. Thus memory allocated by ctypes is properly garbage collected and no dangling pointers should arise.

Pointers to structures returned by an external function are a different story. For such pointers we have to assume that they were not allocated by ctypes, and just keep a pointer to the memory, hoping that it will not go away. This is the same as in ctypes.

For objects that are known to be allocated by RCtypes, a simple GC-aware box suffices to hold the needed memory. For other objects, an extra indirection is required. Thus we use the annotator to track which objects are statically known to be allocated by RCtypes.

2.1.2   RTyping: Memory layout

In ctypes, all instances are mutable boxes containing either some raw memory with a layout compatible to that of the equivalent C type, or a reference to such memory. The reference indirection is transparent to the user; for example, dereferencing a ctypes object "pointer to structure" results in a "structure" object that doesn't include a copy of the data, but only a reference to that data. (This is similar to the C++ notion of reference: it is just a pointer at the machine level, but at the language level it behaves like the object that it points to, not like a pointer.)

The rtyper maps this to the LLType model as follows. For boxes that embed the raw memory content (i.e. are known to own their memory):

Ptr( GcStruct( "name_owner",
       ("c_data", Struct(...) ) ) )

where the raw memory content and layout is specified by the "Struct(...)" part.

The key point here is that the "Struct(...)" part follows exactly the C layout, with no extra headers. If it contains pointers, they point to exact C data as well, without headers, and so on. From C's point of view this is all perfectly legal C data in the format expected by the external C library -- because the C library never sees pointers to the whole GcStruct boxes.

In other words, we have a two-level picture of how data looks like in a translated RPython program: at the RPython level, it contains a graph of GcStructs, with GC headers, and sometimes fields pointing at each other for the purpose of keeping the necessary GcStructs alive. At the same time, inside these GcStruct, there is a part that looks exactly like C data structures. This part can contain pointers to other C data structures, which may live within other GcStructs. So the C libraries will only see and modify the C data that is embedded in the "c_data" substructure of GcStructs. This is somewhat similar to the idea that if you call malloc() in a C program, you actually get a block of memory that is embedded in some larger block with malloc-specific headers; but at the level of C it doesn't matter because you only pass around and store pointers to the "embedded sub-blocks" of memory, never to the whole block with all its headers.

2.1.2.1   Owner/alias boxes

Here is again the case of boxes that embed the raw memory content (owner boxes):

Ptr( GcStruct( "name_owner",
       ("c_data", Struct(...) ) ) )

For boxes that don't embed the raw memory content (alias boxes):

Ptr( GcStruct( "name_alias",
       ("c_data_owner_keepalive", Ptr( GcStruct( "name_owner" ) ),
       ("c_data", Ptr( Struct(...) ) ) ) )

In both cases, the outer GcStruct is needed to make the boxes tracked by the GC automatically. The "c_data" field either embeds or references the raw memory (we reuse the same field name to simplify writing ll helpers). The nested "Struct(...)" definition specifies the exact C layout expected for that memory.

In the alias case, the "c_data_owner_keepalive" pointer is used for cases where the memory was actually allocated by Ctypes, even though the annotator didn't figure it out statically. This pointer field is set in this case to the owner GcStruct, to keep it alive. Note that the "c_data" pointer will then point inside the same owner GcStruct -- it will point to the structure "c_data_owner_keepalive.c_data".

Of course, the "c_data" field is not visible to the RPython-level user.

2.1.2.2   Primitive Types

Ctypes' primitive types are mapped directly to the correspondending PyPy LLType: Signed, Float, etc. For the owned-memory case, we get:

Ptr( GcStruct( "CtypesBox_<TypeName>"
        ( "c_data"
                (Struct "C_Data_<TypeName>
                        ( "value", Signed/Float/etc. ) ) ) ) )

Note that we don't make "c_data" itself a Signed or Float directly because in LLType we can't take pointers to Signed or Float, only to Struct or Array.

The non-owned-memory case is:

Ptr( GcStruct( "CtypesBox_<TypeName>"
        ( "c_data_owner_keepalive" ... ),
        ( "c_data"
                (Ptr(Struct "C_Data_<TypeName>
                        ( "value", Signed/Float/etc. ) ) ) ) ) )

2.1.2.3   Pointers

Ptr( GcStruct( "CtypesBox_<TypeName>"
        ( "c_data"
                (Struct "C_Data_<TypeName>
                        ( "value", Ptr(...) ) ) ) ) )

or:

Ptr( GcStruct( "CtypesBox_<TypeName>"
        ( "c_data_owner_keepalive" ... ),
        ( "c_data"
                (Ptr(Struct "C_Data_<TypeName>
                        ( "value", Ptr(...) ) ) ) ) ) )

However, there is a special case here: the pointer might point to data owned by another CtypesBox -- i.e. it can point to the "c_data" field of some other CtypesBox. In this case we must make sure that the other CtypesBox stays alive. This is done by adding an extra field referencing the gc box (this field is not otherwise used):

Ptr( GcStruct( "CtypesBox_<TypeName>"
        ( "c_data"
                (Struct "C_Data_<TypeName>
                        ( "value", Ptr(...) ) ) )
        ( "keepalive"
                (Ptr(GcStruct("CtypesBox_<TargetTypeName>"))) ) ) )

Note that the above example shows the memory-owning case. As usual, its memory-aliasing equivalent also contains the "c_data_owner_keepalive" field:

Ptr( GcStruct( "CtypesBox_<TypeName>"
        ( "c_data_owner_keepalive" ... ),
        ( "c_data"
                (Ptr(Struct "C_Data_<TypeName>
                        ( "value", Ptr(...) ) ) ) )
        ( "keepalive"
                (Ptr(GcStruct("CtypesBox_<TargetTypeName>"))) ) ) )

The two keepalive fields are easy to confuse with each other, but they have different types and goals. The "c_data_owner_keepalive" is used if the place where the pointer itself is stored needs to be kept alive. The "keepalive" field is used if whatever it is that the pointer points to needs to be kept alive.

2.1.2.4   Structures

Structures have the following memory layout (owning their raw memory) if they were allocated by ctypes:

Ptr( GcStruct( "CtypesBox_<StructName>"
        ( "c_data"
                (Struct "C_Data_<StructName>
                        *<Fieldefintions>) ) ) )

For structures obtained by dereferencing a pointer (by reading its "contents" attribute), the structure box does not own the memory:

Ptr( GcStruct( "CtypesBox_<StructName>"
        ( "c_data_owner_keepalive" ... ),
        ( "c_data"
                (Ptr(Struct "C_Data_<StructName>
                        *<Fieldefintions>) ) ) ) )

One or several Keepalive fields might be necessary in each case: more precisely, for every pointer field that the structure contains, we need an extra keepalive field. It is set when a value is stored in the pointer field of the structure, and when this value is a pointer to RCtypes-owned memory. In this case, the corresponding keepalive field of the GcStruct is made to point to the target GcStruct.

2.1.2.5   Arrays

Arrays behave like structures, but use an Array instead of a Struct in the "c_data" declaration.

For similar reasons, arrays of pointers have an additional array of keepalives.

2.2   RCtypes implemented via a pure RPython interface

As of January 2007, a second implementation of RCtypes is being developed. The basic conclusion on the work on the first implementation is that it is quite tedious to have to write code that generates detailed low-level operations for each high-level ctypes operation - and most importantly, it is not flexible enough for our purposes. The major issue is that problems were recently found in the corner cases of memory management, i.e. when ctypes objects should keep other ctypes object alive and for how long. Fixing this would involve some pervasive changes in the low-level representation of ctypes objects, which would require all the RCtypes RTyping code to be updated. Similarly, to be able to use a Moving Garbage Collector, the memory that is visible to the external C code need to be separated from the GC-managed memory that is used for the RCtypes objects themselves; this would also require a complete upgrade of all the RTyping code.

2.2.1   The rctypesobject RPython library

To solve this, we started by writing a complete RPython library that offers the same functionality as ctypes, but with a more regular, RPython-compatible interface. This library is in the pypy/rlib/rctypes/rctypesobject.py module. All operations use either static or instance methods on classes following a structure similar to the ctypes classes. The flexibility of writing a normal RPython library allowed us to use internal support classes freely, together with data structures appropriate to the memory management needs.

This interface is more tedious to use than ctypes' native interface, but it not meant for manual use. Instead, the translation toolchain maps uses of the usual ctypes interface in the RPython program to the more verbose but more regular rctypesobject interface.

The rctypesobject library is mostly implemented as a family of functions that build and return new classes. For example, Primitive(Signed) return the class corresponding to C objects of the low-level type Signed (corresponding to long in C), RPointer(Primitive(Signed)) is the class corresponding to the type "pointer to long" and RVarArray(RPointer(Primitive(Signed))) return the class corresponding to arrays of pointers to longs.

Each of these classes expose a similar interface: the allocate() static method returns a new instance with its own memory storage (allocated with a separate lltype.malloc() as regular C memory, and automatically freed from the __del__() method of the instance). For example, Primitive(Signed).allocate() returns an instance that manages a word of C memory, large enough to contain just one long. If x designates this instance, the expression pointer(x) allocates and returns an instance of class RPointer(Primitive(Signed)) that manages a word of C memory, large enough to hold a pointer, and initialized to point to the previous long in memory.

The details of the interface can be found in the library's accompanying test file, pypy/rlib/rctypes/test/test_rctypesobject.py.

2.2.2   ControllerEntry

We implemented a generic way in the translation toolchain to map operations on arbitrary Python objects to calls to "glue" RPython classes and methods. This mechanism can be found in pypy/rpython/controllerentry.py. In short, it allows code to register, for each full Python object, class, or metaclass, a corresponding "controller" class. During translation, the controller is invoked when the Python object, class or metaclass is encountered. The annotator delegates to the controller the meaning of all operations - instantiation, attribute reading and setting, etc. By default, this delegation is performed by replacing the operation by a call to a method of the controller; in this way, the controller can be written simply as a family of RPython methods with names like:

  • new() - called when the source RPython program tries to instantiate the full Python class
  • get_xyz() and set_xyz() - called when the RPython program tries to get or set the attribute xyz
  • getitem() and setitem() - called when the RPython program uses the [ ] notation to index the full Python object

If necessary, the controller can also choose to entirely override the default annotating and rtyping behavior and insert its own. This is useful for cases where the method cannot be implemented in RPython, e.g. in the presence of a polymorphic operation that would cause the method signature to be ill-typed.

For RCtypes, we implemented controllers that map the regular ctypes objects, classes and metaclasses to classes and operations from rctypesobject. This turned out to be a good way to separate the RCtypes implementation issues from the (sometimes complicated) interpretation of ctypes' rich and irregular interface.

2.2.3   Performance

The greatest advantage of the ControllerEntry approach over the direct RTyping approach of the first RCtypes implementation is its higher level, giving flexibility. This is also potentially a disadvantage: there is for example no owner/alias analysis done during annotation; instead, all ctypes objects of a given type are implemented identically. We think that the general optimizations that we implemented - most importantly malloc removal - are either good enough to remove the overhead in the common case, or can be made good enough with some more efforts.

XXX work in progress.

3   Appendix

3.1   Finest issues in the behavior of ctypes

For reference, this section describes ctypes behaviour on CPython, as far as it is know to the author and relevant for rctypes. Of course, you should consult the ctypes documentation for more information. The test file pypy/rpython/rctypes/test/test_ctypes.py also exercices many corner cases of ctypes that were relevant for the rctypes implementation design.

3.1.1   Primitive Types

All primitive types behave like their CPython counterparts. Some types like c_long and c_int are identical on some hardware platforms, depending on size of C's int type on those platforms.

3.1.2   Heap Allocated Types

Ctypes deals with heap allocated instances of types in a simple and straightforward manner. However documentationwise it has some shady corners when it comes to heap allocated types.

3.1.2.1   Structures

Structures allocated by ctypes

Structure instances in ctypes are actually proxies for C-compatible memory. The behaviour of such instances is illustrated by the following example structure throughout this document:

class TS( Structure ):
    _fields_ = [ ( "f0", c_int ), ( "f1", c_float ) ]

ts0 = TS( 42, -42 )
ts1 = TS( -17, 17 )

p0 = pointer( ts0 )
p1 = pointer( ts1 )

You can not assign structures by value as in C or C++. But ctypes provides a memmove-function just like C does. You do not need to pass a ctype's pointer to the structure type to memmove. Instead you can pass the structures directly as in the example below:

memmove( ts0, ts1, sizeof( ts0 ) )
assert ts0.f0 == -17
assert ts0.f1 == 17

3.1.2.2   Structures created from foreign pointers

The author is currently not sure, whether the text below is correct in all aspects, but carefully planned trials did not provide any evidence to the contrary.

Structure instances can also be created by dereferencing pointers that where returned by a call to an external function.

More on pointers in the next section.

3.1.3   Pointers

Pointer types are created by a factory function named POINTER. Pointers can be created by either calling and thus instanciating a pointer type or by calling another function named pointer that creates the necessary pointer type on the fly and returns a pointer to the instance.

Pointers only implement one attribute named contents which references the ctypes instance the pointer points to. Assigning to this attribute only changes the pointers notion of the object it points to. The instances themselves are not touched, especially structures are not copied by value.

3.1.3.1   Pointers to Structures Allocated by The Ctypes

Pointers to structures that where allocated by ctypes contain a reference to the structure in the contents attribute mentioned above. This reference is know to the garbage collector, which means that even if the last structure instance is deallocated, the C-compatible memory is not, provided a pointer still points to it.

3.1.3.2   Pointers to Structures Allocated by Foreign Modules

In this case the structure was probably allocated by the foreign module - at least ctypes must assume it was. In this case the pointer's reference to the structure should not be made known to the GC. In the same sense the structure itself must record the fact that its C-compatible memory was not allocated by ctypes itself.