A foreword of warning about the JIT of PyPy as of March 2007: single functions doing integer arithmetic get great speed-ups; about anything else will be a bit slower with the JIT than without. We are working on this - you can even expect quick progress, because it is mostly a matter of adding a few careful hints in the source code of the Python interpreter of PyPy.
By construction, the JIT is supposed to work correctly on absolutely any kind of Python code: generators, nested scopes, exec statements, sys._getframe().f_back.f_back.f_locals, etc. (the latter is an example of expression that Psyco cannot emulate correctly). However, there are a couple of known issues for now (see Caveats).
So far there is little point in trying the JIT on anything else than arithmetic-intensive functions (unless you want to help find bugs). For small examples, you can also look at the machine code it produces, but if you do please keep in mind that the assembler will look fundamentally different after we extend the range of PyPy that the JIT generator processes.
Go to pypy/translator/goal/ and run:
./translate.py --jit targetpypystandalone
Please read the Status section above first.
This will produce the C code for a version pypy-c that includes both a regular interpreter and an automatically generated JIT compiler. This pypy-c uses its interpreter by default, and due to some overhead we expect this interpreter to be a bit slower than the one found in a pypy-c compiled without JIT.
In addition to --jit, you can also pass the normal options to translate.py to compile different flavors of PyPy with a JIT. See the compatibility matrix for the combinations known to be working right now. (The combination of the JIT with the thunk or taint object spaces probably works too, but we don't expect it to generate good code before we drop a few extra hints in the source code of the object spaces.)
You can mark one or many code objects as candidates for being run by the JIT as follows:
>>>> def f(x): return x*5 >>>> import pypyjit >>>> pypyjit.enable(f.func_code) >>>> f(7) # the JIT runs here 35 >>>> f(8) # machine code already generated, no more jitting occurs here 40
A few examples of this kind can be found in demo/jit/. The script demo/jit/f1.py shows a function that becomes seriously faster with the JIT - only 10% to 20% slower than what gcc -O0 produces from the obvious equivalent C code, a result similar to Psyco. Although the JIT generation process is well-tested, we only have a few tests directly for the final pypy-c. Try:
pypy-c test_all.py module/pypyjit/test/test_pypy_c.py -A --nomagic
You can get a dump of the generated machine code by setting the environment variable PYPYJITLOG to a file name before you start pypy-c. See In more details above. To inspect this file, use the following tool:
python pypy/jit/codegen/i386/viewcode.py dumpfilename
The viewcode.py script is based on the Linux tool objdump to produce a disassembly. It should be easy to port to OS/X. If you want to port the tool to Windows, have a look at http://codespeak.net/svn/psyco/dist/py-utils/xam.py : this is the tool from which viewcode.py was derived in the first place, but Windows-specific parts were omitted for lack of a Windows machine to try them on.
When running JIT'ed code, the bytecode tracing hook is not invoked. This should be the only visible effect of the JIT, aside from the debug prints and the speed/memory impact. In practice, of course, it still has also got rough edges.
One of them is that all compile-time errors are fatal for now, because it is hard to recover from them. Clearly, the compiler is not supposed to fail, but it can occur because the memory runs out, a bug hits, or for more subtle reasons. For example, overflowing the stack is likely to cause the JIT compiler to try to compile the app-level handlers for the RuntimeError, and compiling takes up stack space too - so the compiler, running on top of the already-full stack, might hit the stack limit again before it has got a chance to generate code for the app-level handlers.
The central idea of the PyPy JIT is to be completely independent from the Python language. We did not write down a JIT compiler by hand. Instead, we generate it anew during each translation of pypy-c.
This means that the same technique works out of the box for any other language for which we have an interpreter written in RPython. The technique works on interpreters of any size, from tiny to PyPy.
Aside from the obvious advantage, it means that we can show all the basic ideas of the technique on a tiny interpreter. The fact that we have done the same on the whole of PyPy shows that the approach scales well. So we will follow in the sequel the example of small interpreters and insert a JIT compiler into them during translation.
The important terms:
Let's consider a very small interpreter-like example:
def ll_plus_minus(s, x, y): acc = x pc = 0 while pc < len(s): op = s[pc] hint(op, concrete=True) if op == '+': acc += y elif op == '-': acc -= y pc += 1 return acc
Here, s is an input program which is simply a string of '+' or '-'. The x and y are integer input arguments. The source code of this example is in pypy/jit/tl/tiny1.py.
Ideally, turning an interpreter into a JIT compiler is only a matter of adding a few hints. In practice, the current JIT generation framework has many limitations and rough edges requiring workarounds. On the above example, though, it works out of the box. We only need one hint, the central hint that all interpreters need. In the source, it is the line:
hint(op, concrete=True)
This hint says: "at this point in time, ensure that op is a compile-time constant". The motivation for such a hint is that the most important source of inefficiency in a small interpreter is the switch on the next opcode, here op. If op is known at the time where the JIT compiler runs, then the whole switch dispatch can be constant-folded away; only the case that applies remains.
The way the concrete=True hint works is by setting a constraint: it requires op to be a compile-time constant. During translation, a phase called hint-annotation processes these hints and tries to satisfy the constraints. Without further hints, the only way that op could be a compile-time constant is if all the other values that op depends on are also compile-time constants. So the hint-annotator will also mark s and pc as compile-time constants.
You can see the results of the hint-annotator with the following commands:
cd pypy/jit/tl python ../../translator/goal/translate.py --hintannotate targettiny1.py
Click on ll_plus_minus in the Pygame viewer to get a nicely colored graph of that function. The graph contains the low-level operations produced by RTyping. The green variables are the ones that have been forced to be compile-time constants by the hint-annotator. The red variables are the ones that will generally not be compile-time constants, although the JIT compiler is also able to do constant propagation of red variables if they contain compile-time constants in the first place.
In this example, when the JIT runs, it generates machine code that is simply a list of additions and subtractions, as indicated by the '+' and '-' characters in the input string. To understand why it does this, consider the colored graph of ll_plus_minus more closely. A way to understand this graph is to consider that it is no longer the graph of the ll_plus_minus interpreter, but really the graph of the JIT compiler itself. All operations involving only green variables will be performed by the JIT compiler at compile-time. In this case, the whole looping code only involves the green variables s and pc, so the JIT compiler itself will loop over all the opcodes of the bytecode string s, fetch the characters and do the switch on them. The only operations involving red variables are the int_add and int_sub operations in the implementation of the '+' and '-' opcodes respectively. These are the operations that will be generated as machine code by the JIT.
Over-simplifying, we can say that at the end of translation, in the actual implementation of the JIT compiler, operations involving only green variables are kept unchanged, and operations involving red variables have been replaced by calls to helpers. These helpers contain the logic to generate a copy of the original operation, as machine code, directly into memory.
Now try translating tiny1.py with a JIT without stopping at the hint-annotation viewer:
python ../../translator/goal/translate.py --jit targettiny1.py
Test it:
./targettiny1-c +++-+++ 100 10 150
What occurred here is that the colored graph seen above was turned into a JIT compiler, and the original ll_plus_minus function was patched. Whenever that function is called from the rest of the program (in this case, from entry_point() in pypy/jit/tl/targettiny1.py), then instead of the original code performing the interpretation, the patched function performs the following operations:
The idea is that interpreting the same bytecode over and over again with different values of x and y should be the fast path: the compilation step is only required the first time.
On 386-compatible processors running Linux, you can inspect the generated machine code as follows:
PYPYJITLOG=log ./targettiny1-c +++-+++ 100 10 python ../../jit/codegen/i386/viewcode.py log
If you are familiar with GNU-style 386 assembler, you will notice that the code is a single block with no jump, containing the three additions, the subtraction, and the three further additions. The machine code is not particularly optimal in this example because all the values are input arguments of the function, so they are reloaded and stored back in the stack at every operation. The current backend tends to use registers in a (slightly) more reasonable way on more complicated examples.
The interpreter in pypy/jit/tl/tiny2.py is a reasonably good example of the difficulties that we meet when scaling up this approach, and how we solve them - or work around them. For more details, see the comments in the source code. With more work on the JIT generator, we hope to be eventually able to remove the need for the workarounds.
The most powerful hint introduced in this example is promote=True. It is applied to a value that is usually not a compile-time constant, but which we would like to become a compile-time constant "just in time". Its meaning is to instruct the JIT compiler to stop compiling at this point, wait until the runtime actually reaches that point, grab the value that arrived here at runtime, and go on compiling with the value now considered as a compile-time constant. If the same point is reached at runtime several times with several different values, the compiler will produce one code path for each, with a switch in the generated code. This is a process that is never "finished": in general, new values can always show up later during runtime, causing more code paths to be compiled and the switch in the generated code to be extended.
Promotion is the essential new feature introduced in PyPy when compared to existing partial evaluation techniques (it was actually first introduced in Psyco [JITSPEC], which is strictly speaking not a partial evaluator).
Another way to understand the effect of promotion is to consider it as a complement to the concrete=True hint. The latter tells the hint-annotator that the value that arrives here is required to be a compile-time constant (i.e. green). In general, this is a very strong constraint, because it forces "backwards" a potentially large number of values to be green as well - all the values that this one depends on. In general, it does not work at all, because the value ultimately depends on an operation that cannot be constant-folded at all by the JIT compiler, e.g. because it depends on external input or reads from non-immutable memory.
The promote=True hint can take an arbitrary red value and returns it as a green variable, so it can be used to bound the set of values that need to be forced to green. A common idiom is to put a concrete=True hint at the precise point where a compile-time constant would be useful (e.g. on the value on which a complex switch dispatches), and then put a few promote=True hints to copy specific values into green variables before the concrete=True.
The promote=True hints should be applied where we expect not too many different values to arrive at runtime; here are typical examples:
The other hints mentioned in pypy/jit/tl/tiny2.py are "global merge points" and "deepfreeze". For more information, please refer to the explanations there.
We should also mention a technique not used in tiny2.py, which is the notion of virtualizable objects. In PyPy, the Python frame objects are virtualizable. Such objects assume that they will be mostly read and mutated by the JIT'ed code - this is typical of frame objects in most interpreters: they are either not visible at all for the interpreted programs, or (as in Python) you have to access them using some reflection API. The _virtualizable_ hint allows the object to escape (e.g. in PyPy, the Python frame object is pushed on the globally-accessible frame stack) while still remaining efficient to access from JIT'ed code.
This section is work in progress!
One the of the central goals of the PyPy project is to automatically produce a Just in Time Compiler from the interpreter, with as little as possible intervention on the interpreter codebase itself. The Just in Time Compiler should be another aspect as much as possible transparently introduced by and during the translation process.
Partial evaluation techniques should, at least theoretically, allow such a derivation of a compiler from an interpreter.
The forest of flow graphs that the translation process generates and transforms constitutes a reasonable base for the necessary analyses. That's a further reason why having an high-level runnable and analysable description of the language was always a central tenet of the project.
Transforming an interpreter into a compiler involves constructing a so called generating extension, which takes input programs to the interpreter, and produces what would be the output of partially evaluating the interpreter, with the input program fixed and the input data left as a variable. The generating extension is essentially capable of compiling the input programs.
Generating extensions can be produced by self-applying partial evaluators, but this approach may lead to not optimal results or be not scalable.
For PyPy, our approach aims at producing the generating extension more directly from the analysed interpreter in the form of a forest of flow graphs. We call such process timeshifting.
To be able to achieve this, gathering binding time information is crucial. This means distinguishing values in the data-flow of the interpreter which are compile-time bound and immutable at run-time, versus respectively runtime values.
Currently we base the binding time computation on propagating the information based on a few hint inserted in the interpreter. Propagation is implemented by reusing our annotation/type inference framework.
The code produced by a generating extension for an input program may not be good, especially for a dynamic language, because essentially the input program doesn't contain enough information to generate good code. What is really desired is not a generating extension doing static compilation, but one capable of dynamic compilation, exploiting runtime information in its result. Compilation should be able to suspend, let the produced code run to collect run-time information (for example language-level types), and then resume with this extra information. This allow it to generate code optimised for the effective run-time behaviour of the program.
Inspired by Psyco, which is in some sense a hand-written generating extension for Python, we added support for so-called promotion to our framework for producing generating extensions.
Simply put, promotion on a value stops compilation and waits until the runtime reaches this point. When it does, the actual runtime value is promoted into a compile-time constant, and compilation resumes with this extra information. Concretely, the promotion expands into a switch in the generated code. The switch contains one case for each runtime value encountered so far, to chose which of the specialized code paths the runtime execution should follow. The switch initially contains no case at all, but only a fall-back path. The fall-back invokes the compiler with the newly seen runtime value. The compiler produces the new specialized code path. It then patches the switch to add this new case, so that the next time the same value is encountered at runtime, the execution can directly jump to the correct specialized code path.
This can also be thought of as a generalisation of polymorphic inline caches.
Partial evaluation is the process of evaluating a function, say f(x, y), with only partial information about the value of its arguments, say the value of the x argument only. This produces a residual function g(y), which takes less arguments than the original - only the information not specified during the partial evaluation process need to be provided to the residual function, in this example the y argument.
Partial evaluation (PE) comes in two flavors:
Off-line partial evaluation is based on binding-time analysis, which is the process of determining among the variables used in a function (or a set of functions) which ones are going to be known in advance and which ones are not. In the above example, such an analysis would be able to infer that the constantness of the argument x implies the constantness of many intermediate values used in the function. The binding time of a variable determines how early the value of the variable will be known.
The PyPy JIT is generated using off-line partial evaluation. As such, there are three distinct phases:
The binding-time terminology that we are using in PyPy is based on the colors that we use when displaying the control flow graphs:
The expanded version of the present document may be of interest to you if you are already familiar with the domain of Partial Evaluation and are looking for a quick overview of some of our techniques.
[JITSPEC] | Representation-Based Just-In-Time Specialization and the Psyco Prototype for Python, ACM SIGPLAN PEPM'04, August 24-26, 2004, Verona, Italy. http://psyco.sourceforge.net/psyco-pepm-a.ps.gz |