PyPy can expose to its user language features similar to the ones present in Stackless Python: no recursion depth limit, and the ability to write code in a massively concurrent style. It actually exposes three different paradigms to choose from:
All of them are extremely light-weight, which means that PyPy should be able to handle programs containing large amounts of coroutines, tasklets and greenlets.
If you are running py.py on top of CPython, then you need to enable the _stackless module by running it as follows:
py.py --withmod-_stackless
This is implemented internally using greenlets, so it only works on a platform where Py lib greenlets are supported. A few features do not work this way, though, and really require a translated pypy-c.
To obtain a translated version of pypy-c that includes Stackless support, run translate.py as follows:
cd pypy/translator/goal python translate.py --stackless
A stackless PyPy contains a module called stackless. The interface exposed by this module have not been refined much, so it should be considered in-flux (as of 2007).
So far, PyPy does not provide support for stackless in a threaded environment. This limitation is not fundamental, as previous experience has shown, so supporting this would probably be reasonably easy.
An interesting point is that the same stackless module can provide a number of different concurrency paradigms at the same time. From a theoretical point of view, none of above-mentioned existing three paradigms considered on its own is new: two of them are from previous Python work, and the third one is a variant of the classical coroutine. The new part is that the PyPy implementation manages to provide all of them and let the user implement more. Moreover - and this might be an important theoretical contribution of this work - we manage to provide these concurrency concepts in a "composable" way. In other words, it is possible to naturally mix in a single application multiple concurrency paradigms, and multiple unrelated usages of the same paradigm. This is discussed in the Composability section below.
Any stackless PyPy executable natively supports recursion that is only limited by the available memory. As in normal Python, though, there is an initial recursion limit (which is 5000 in all pypy-c's, and 1000 in CPython). It can be changed with sys.setrecursionlimit(). With a stackless PyPy, any value is acceptable - use sys.maxint for unlimited.
A Coroutine is similar to a very small thread, with no preemptive scheduling. Within a family of coroutines, the flow of execution is explicitly transferred from one to another by the programmer. When execution is transferred to a coroutine, it begins to execute some Python code. When it transfers execution away from itself it is temporarily suspended, and when execution returns to it it resumes its execution from the point where it was suspended. Conceptually, only one coroutine is actively running at any given time (but see Composability below).
The stackless.coroutine class is instantiated with no argument. It provides the following methods and attributes:
stackless.coroutine.getcurrent()
Static method returning the currently running coroutine. There is a so-called "main" coroutine object that represents the "outer" execution context, where your main program started and where it runs as long as it does not switch to another coroutine.
coro.bind(callable, *args, **kwds)
Bind the coroutine so that it will execute callable(*args, **kwds). The call is not performed immediately, but only the first time we call the coro.switch() method. A coroutine must be bound before it is switched to. When the coroutine finishes (because the call to the callable returns), the coroutine exits and implicitly switches back to another coroutine (its "parent"); after this point, it is possible to bind it again and switch to it again. (Which coroutine is the parent of which is not documented, as it is likely to change when the interface is refined.)
coro.switch()
Suspend the current (caller) coroutine, and resume execution in the target coroutine coro.
coro.kill()
Kill coro by sending an exception to it. (At the moment, the exception is not visible to app-level, which means that you cannot catch it, and that try: finally: clauses are not honored. This will be fixed in the future.)
Here is a classical producer/consumer example: an algorithm computes a sequence of values, while another consumes them. For our purposes we assume that the producer can generate several values at once, and the consumer can process up to 3 values in a batch - it can also process batches with less than 3 values without waiting for the producer (which would be messy to express with a classical Python generator).
def producer(lst): while True: ...compute some more values... lst.extend(new_values) coro_consumer.switch() def consumer(lst): while True: # First ask the producer for more values if needed while len(lst) == 0: coro_producer.switch() # Process the available values in a batch, but at most 3 batch = lst[:3] del lst[:3] ...process batch... # Initialize two coroutines with a shared list as argument exchangelst = [] coro_producer = coroutine() coro_producer.bind(producer, exchangelst) coro_consumer = coroutine() coro_consumer.bind(consumer, exchangelst) # Start running the consumer coroutine coro_consumer.switch()
The stackless module also provides an interface that is roughly compatible with the interface of the stackless module in Stackless Python: it contains stackless.tasklet and stackless.channel classes. Tasklets are also similar to microthreads, but (like coroutines) they don't actually run in parallel with other microthreads; instead, they synchronize and exchange data with each other over Channels, and these exchanges determine which Tasklet runs next.
For usage reference, see the documentation on the Stackless Python website.
Note that Tasklets and Channels are implemented at application-level in pypy/lib/stackless.py on top of coroutines. You can refer to this module for more details and API documentation.
The stackless.py code tries to resemble the stackless C code as much as possible. This makes the code somewhat unpythonic.
Tasklets are a bit like threads: they encapsulate a function in such a way that they can be suspended/restarted any time. Unlike threads, they won't run concurrently, but must be cooperative. When using stackless features, it is vitaly important that no action is performed that blocks everything else. In particular, blocking input/output should be centralized to a single tasklet.
Communication between tasklets is done via channels. There are three ways for a tasklet to give up control:
A (live) tasklet can either be running, waiting to get scheduled, or be blocked by a channel.
Scheduling is done in strictly round-robin manner. A blocked tasklet is removed from the scheduling queue and will be reinserted when it becomes unblocked.
Here is a many-producers many-consumers example, where any consumer can process the result of any producer. For this situation we set up a single channel where all producer send, and on which all consumers wait:
def producer(chan): while True: chan.send(...next value...) def consumer(chan): while True: x = chan.receive() ...do something with x... # Set up the N producer and M consumer tasklets common_channel = stackless.channel() for i in range(N): stackless.tasklet(producer, common_channel)() for i in range(M): stackless.tasklet(consumer, common_channel)() # Run it all stackless.run()
Each item sent over the channel is received by one of the waiting consumers; which one is not specified. The producers block until their item is consumed: the channel is not a queue, but rather a meeting point which causes tasklets to block until both a consumer and a producer are ready. In practice, the reason for having several consumers receiving on a single channel is that some of the consumers can be busy in other ways part of the time. For example, each consumer might receive a database request, process it, and send the result to a further channel before it asks for the next request. In this situation, further requests can still be received by other consumers.
A Greenlet is a kind of primitive Tasklet with a lower-level interface and with exact control over the execution order. Greenlets are similar to Coroutines, with a slightly different interface: greenlets put more emphasis on a tree structure. The various greenlets of a program form a precise tree, which fully determines their order of execution.
For usage reference, see the documentation of the Py lib greenlets. The PyPy interface is identical. You should use py.magic.greenlet instead of stackless.greenlet directly, because the Py lib can now give you the latter when you ask for the former on top of PyPy.
PyPy's greenlets do not suffer from the cyclic GC limitation that the CPython greenlets have: greenlets referencing each other via local variables tend to leak on top of CPython (where it is mostly impossible to do the right thing). It works correctly on top of PyPy.
Coroutines and tasklets can be pickled and unpickled, i.e. serialized to a string of bytes for the purpose of storage or transmission. This allows "live" coroutines or tasklets to be made persistent, moved to other machines, or cloned in any way. The standard pickle module works with coroutines and tasklets (at least in a translated pypy-c; unpickling live coroutines or tasklets cannot be easily implemented on top of CPython).
To be able to achieve this result, we have to consider many objects that are not normally picklable in CPython. Here again, the Stackless Python implementation has paved the way, and we follow the same general design decisions: simple internal objects like bound method objects and various kinds of iterators are supported; frame objects can be fully pickled and unpickled (by serializing a reference to the bytecode they are running in addition to all the local variables). References to globals and modules are pickled by name, similarly to references to functions and classes in the traditional CPython pickle.
The "magic" part of this process is the implementation of the unpickling of a chain of frames. The Python interpreter of PyPy uses interpreter-level recursion to represent application-level calls. The reason for this is that it tremendously simplifies the implementation of the interpreter itself. Indeed, in Python, almost any operation can potentially result in a non-tail-recursive call to another Python function. This makes writing a non-recursive interpreter extremely tedious; instead, we rely on lower-level transformations during the translation process to control this recursion. This is the Stackless Transform, which is at the heart of PyPy's support for stackless-style concurrency.
At any point in time, a chain of Python-level frames corresponds to a chain of interpreter-level frames (e.g. C frames in pypy-c), where each single Python-level frame corresponds to one or a few interpreter-level frames - depending on the length of the interpreter-level call chain from one bytecode evaluation loop to the next (recursively invoked) one.
This means that it is not sufficient to simply create a chain of Python frame objects in the heap of a process before we can resume execution of these newly built frames. We must recreate a corresponding chain of interpreter-level frames. To this end, we have inserted a few named resume points (see 3.2.4, in D07.1 Massive Parallelism and Translation Aspects) in the Python interpreter of PyPy. This is the motivation for implementing the interpreter-level primitives resume_state_create() and resume_state_invoke(), the powerful interface that allows an RPython program to artifically rebuild a chain of calls in a reflective way, completely from scratch, and jump to it.
(See demo/pickle_coroutine.py for the complete source of this demo.)
Consider a program which contains a part performing a long-running computation:
def ackermann(x, y): if x == 0: return y + 1 if y == 0: return ackermann(x - 1, 1) return ackermann(x - 1, ackermann(x, y - 1))
By using pickling, we can save the state of the computation while it is running, for the purpose of restoring it later and continuing the computation at another time or on a different machine. However, pickling does not produce a whole-program dump: it can only pickle individual coroutines. This means that the computation should be started in its own coroutine:
# Make a coroutine that will run 'ackermann(3, 8)' coro = coroutine() coro.bind(ackermann, 3, 8) # Now start running the coroutine result = coro.switch()
The coroutine itself must switch back to the main program when it needs to be interrupted (we can only pickle suspended coroutines). Due to current limitations this requires an explicit check in the ackermann() function:
def ackermann(x, y): if interrupt_flag: # test a global flag main.switch() # and switch back to 'main' if it is set if x == 0: return y + 1 if y == 0: return ackermann(x - 1, 1) return ackermann(x - 1, ackermann(x, y - 1))
The global interrupt_flag would be set for example by a timeout, or by a signal handler reacting to Ctrl-C, etc. It causes the coroutine to transfer control back to the main program. The execution comes back just after the line coro.switch(), where we can pickle the coroutine if necessary:
if not coro.is_alive: print "finished; the result is:", result else: # save the state of the suspended coroutine f = open('demo.pickle', 'w') pickle.dump(coro, f) f.close()
The process can then stop. At any later time, or on another machine, we can reload the file and restart the coroutine with:
f = open('demo.pickle', 'r') coro = pickle.load(f) f.close() result = coro.switch()
Coroutine pickling is subject to some limitations. First of all, it is not a whole-program "memory dump". It means that only the "local" state of a coroutine is saved. The local state is defined to include the chain of calls and the local variables, but not for example the value of any global variable.
As in normal Python, the pickle will not include any function object's code, any class definition, etc., but only references to functions and classes. Unlike normal Python, the pickle contains frames. A pickled frame stores a bytecode index, representing the current execution position. This means that the user program cannot be modified at all between pickling and unpickling!
On the other hand, the pickled data is fairly independent from the platform and from the PyPy version.
Pickling/unpickling fails if the coroutine is suspended in a state that involves Python frames which were indirectly called. To define this more precisely, a Python function can issue a regular function or method call to invoke another Python function - this is a direct call and can be pickled and unpickled. But there are many ways to invoke a Python function indirectly. For example, most operators can invoke a special method __xyz__() on a class, various built-in functions can call back Python functions, signals can invoke signal handlers, and so on. These cases are not supported yet.
In theory, coroutine pickling is general enough to allow coroutines to be cloned within a process; i.e. from one suspended coroutine, a copy can be made - simply by pickling and immediately unpickling it. Both the original and the copy can then continue execution from the same point on. Cloning gives much of the expressive power of full continuations.
However, pickling has several problems in practice (besides a relatively high overhead). It is not a completely general solution because not all kinds of objects can be pickled; moreover, which objects are pickled by value or by reference only depends on the type of the object. For the purpose of cloning, this means that coroutines cannot be pickled/unpickled in all situations, and even when they can, the user does not have full control over which of the objects currently reachable from a coroutine will be duplicated, and which will be shared with the original coroutine.
For this reason, we implemented a direct cloning operation. After some experiments, we determined that the following behavior is usually considered correct: when cloning a coroutine C, we duplicate exactly those objects that were created by C (i.e. while C was running). The objects not created by C (e.g. pre-existing, or created outside) are not duplicated, but directly shared between C and its new copy. This heuristic generally matches the intuition that the objects created by C are also the ones "owned" by C, in the sense that if C is cloned, the clone needs its own copy - to avoid mutating the same shared object in confliciting ways. Conversely, objects of a more "global" nature, like modules and functions, which are typically created before the coroutine C started, should not be duplicated; this would result in unexpectedly invisible side-effects if they are mutated by the clone of C.
The implementation of the above heuristic is based on support from the garbage collector. For this reason, it is only available if both stackless and our own framework GC are compiled together in pypy-c:
cd pypy/translator/goal python translate.py --stackless --gc=framework
In this mode, our garbage collector is extended to support the notion of "pool": a pool is a linked list of allocated objects. All objects allocated go to a "current" pool. When the stackless module switches execution between two ClonableCoroutine objects, it switches the GC's current pool as well, so that the allocated objects end up segregated by coroutine. Cloning is implemented by another GC extension which makes byte-level copies of allocated objects. Pointers to objects inside the current pool cause the target objects to be recursively copied; pointers outside the current pool are simply shared.
Two interfaces are available to clone a coroutine:
coro.clone()
Clones a suspended coroutine, returning the new copy.
fork()
This global function (from the _stackless module) clones the currently running coroutine, returning the new copy - which, in this case, is a children of the original coroutine (i.e. its parent is the original coroutine). When the parent or any other coroutine eventually switches to the child, execution appears to return from the same fork() call, this time with a None return value.
Forking is a natural way to implement backtracking. Consider a coroutine other than the main one (the main coroutine cannot be cloned or forked) which issues a call to the following function:
def zero_or_one(): subcoro = fork() if subcoro is not None: try: subcoro.switch() # in the parent: run the child first except Fail: pass return 1 # then proceed with answer 1 else: return 0 # in the child: answer 0
It clones the current coroutine, and switches to the new child. In the latter, zero_or_one() returns 0, so the caller sees the value 0 first. But at any later point in time (even from a completely unrelated place, as long as it is still in the same coroutine) the caller may decide that 0 is not a suitable value, and backtrack and try again with the value 1. To do so, it simply raises a Fail exception. The exception stops the child coroutine, which comes back to the parent, where the Fail is caught and the value 1 is returned from zero_or_one().
The tests in pypy/module/_stackless/test/test_clonable.py contain an example using exactly the above function to search for a sequence of zeroes and ones that has a specific property, by trial-and-error. This allow a style that looks similar to what is traditionally reserved to logic programming languages.
Although the concept of coroutines is far from new, they have not been generally integrated into mainstream languages, or only in limited form (like generators in Python and iterators in C#). We can argue that a possible reason for that is that they do not scale well when a program's complexity increases: they look attractive in small examples, but the models that require explicit switching, by naming the target coroutine, do not compose naturally. This means that a program that uses coroutines for two unrelated purposes may run into conflicts caused by unexpected interactions.
To illustrate the problem, consider the following example (simplified code; see the full source in pypy/module/_stackless/test/test_composable_coroutine.py). First, a simple usage of coroutine:
main_coro = coroutine.getcurrent() # the main (outer) coroutine data = [] def data_producer(): for i in range(10): # add some numbers to the list 'data' ... data.append(i) data.append(i * 5) data.append(i * 25) # and then switch back to main to continue processing main_coro.switch() producer_coro = coroutine() producer_coro.bind(data_producer) def grab_next_value(): if not data: # put some more numbers in the 'data' list if needed producer_coro.switch() # then grab the next value from the list return data.pop(0)
Every call to grab_next_value() returns a single value, but if necessary it switches into the producer function (and back) to give it a chance to put some more numbers in it.
Now consider a simple reimplementation of Python's generators in term of coroutines:
def generator(f): """Wrap a function 'f' so that it behaves like a generator.""" def wrappedfunc(*args, **kwds): g = generator_iterator() g.bind(f, *args, **kwds) return g return wrappedfunc class generator_iterator(coroutine): def __iter__(self): return self def next(self): self.caller = coroutine.getcurrent() self.switch() return self.answer def Yield(value): """Yield the value from the current generator.""" g = coroutine.getcurrent() g.answer = value g.caller.switch() def squares(n): """Demo generator, producing square numbers.""" for i in range(n): Yield(i * i) squares = generator(squares) for x in squares(5): print x # this prints 0, 1, 4, 9, 16
Both these examples are attractively elegant. However, they cannot be composed. If we try to write the following generator:
def grab_values(n): for i in range(n): Yield(grab_next_value()) grab_values = generator(grab_values)
then the program does not behave as expected. The reason is the following. The generator coroutine that executes grab_values() calls grab_next_value(), which may switch to the producer_coro coroutine. This works so far, but the switching back from data_producer() to main_coro lands in the wrong coroutine: it resumes execution in the main coroutine, which is not the one from which it comes. We expect data_producer() to switch back to the grab_next_values() call, but the latter lives in the generator coroutine g created in wrappedfunc, which is totally unknown to the data_producer() code. Instead, we really switch back to the main coroutine, which confuses the generator_iterator.next() method (it gets resumed, but not as a result of a call to Yield()).
As part of trying to combine multiple different paradigms into a single application-level module, we have built a way to solve this problem. The idea is to avoid the notion of a single, global "main" coroutine (or a single main greenlet, or a single main tasklet). Instead, each conceptually separated user of one of these concurrency interfaces can create its own "view" on what the main coroutine/greenlet/tasklet is, which other coroutine/greenlet/tasklets there are, and which of these is the currently running one. Each "view" is orthogonal to the others. In particular, each view has one (and exactly one) "current" coroutine/greenlet/tasklet at any point in time. When the user switches to a coroutine/greenlet/tasklet, it implicitly means that he wants to switch away from the current coroutine/greenlet/tasklet that belongs to the same view as the target.
The precise application-level interface has not been fixed yet; so far, "views" in the above sense are objects of the type stackless.usercostate. The above two examples can be rewritten in the following way:
producer_view = stackless.usercostate() # a local view main_coro = producer_view.getcurrent() # the main (outer) coroutine ... producer_coro = producer_view.newcoroutine() ...
and:
generators_view = stackless.usercostate() def generator(f): def wrappedfunc(*args, **kwds): g = generators_view.newcoroutine(generator_iterator) ... ...generators_view.getcurrent()...
Then the composition grab_values() works as expected, because the two views are independent. The coroutine captured as self.caller in the generator_iterator.next() method is the main coroutine of the generators_view. It is no longer the same object as the main coroutine of the producer_view, so when data_producer() issues the following command:
main_coro.switch()
the control flow cannot accidentally jump back to generator_iterator.next(). In other words, from the point of view of producer_view, the function grab_next_value() always runs in its main coroutine main_coro and the function data_producer in its coroutine producer_coro. This is the case independently of which generators_view-based coroutine is the current one when grab_next_value() is called.
Only code that has explicit access to the producer_view or its coroutine objects can perform switches that are relevant for the generator code. If the view object and the coroutine objects that share this view are all properly encapsulated inside the generator logic, no external code can accidentally temper with the expected control flow any longer.
In conclusion: we will probably change the app-level interface of PyPy's stackless module in the future to not expose coroutines and greenlets at all, but only views. They are not much more difficult to use, and they scale automatically to larger programs.