PyPy
PyPy[parser]

PyPy Parser

WARNING: this document is out-of-date and unfinished. Our compiler is done by now, unlike explained below.

Overview

The PyPy parser includes a tokenizer and a recursive descent parser.

Tokenizer

The tokenizer accepts as string as input and provides tokens through a next() and a peek() method. The tokenizer is implemented as a finite automata like lex.

Parser

The parser is a tree of grammar rules. EBNF grammar rules are decomposed as a tree of objects. Looking at a grammar rule one can see it is composed of a set of four basic subrules. In the following exemple we have all four of them:

S <- A '+' (C | D) +

The previous line says S is a sequence of symbol A, token '+', a subrule + which matches one or more of an alternative between symbol C and symbol D. Thus the four basic grammar rule types are : * sequence * alternative * multiplicity (called kleen star after the * multiplicity type) * token

The four types are represented by a class in pyparser/grammar.py ( Sequence, Alternative, KleeneStar, Token) all classes have a match() method accepting a source (the tokenizer) and a builder (an object responsible for building something out of the grammar).

Here's a basic exemple and how the grammar is represented:

S <- A ('+'|'-') A
A <- V ( ('*'|'/') V )*
V <- 'x' | 'y'

In python:
V = Alternative( Token('x'), Token('y') )
A = Sequence( V,
       KleeneStar(
           Sequence(
              Alternative( Token('*'), Token('/') ), V
                   )
                )
             )
S = Sequence( A, Alternative( Token('+'), Token('-') ), A )

Status

See README.compiling on the status of the parser(s) implementation of PyPy

Detailed design

Building the Python grammar

The python grammar is built at startup from the pristine CPython grammar file. The grammar framework is first used to build a simple grammar to parse the grammar itself. The builder provided to the parser generates another grammar which is the Python grammar itself. The grammar file should represent an LL(1) grammar. LL(k) should still work since the parser supports backtracking through the use of source and builder contexts (The memento patterns for those who like Design Patterns)

The match function for a sequence is pretty simple:

for each rule in the sequence:
   save the source and builder context
   if the rule doesn't match:
       restore the source and builder context
       return false
call the builder method to build the sequence
return true

Now this really is an LL(0) grammar since it explores the whole tree of rule possibilities. In fact there is another member of the rule objects which is built once the grammar is complete. This member is a set of the tokens that match the begining of each rule. Like the grammar it is precomputed at startup. Then each rule starts by the following test:

if source.peek() not in self.firstset: return false

Efficiency should be similar (not too worse) to an automata based grammar since it is basicly building an automata but it uses the execution stack to store its parsing state. This also means that recursion in the grammar are directly translated as recursive calls.

Redisigning the parser to remove recursion shouldn't be difficult but would make the code less obvious. (patches welcome). The basic idea

Parsing

This grammar is then used to parse Python input and transform it into a syntax tree.

As of now the syntax tree is built as a tuple to match the output of the parser module and feed it to the compiler package

the compiler package uses the Transformer class to transform this tuple tree into an abstract syntax tree.

sticking to our previous example, the syntax tree for x+x*y would be:

Rule('S', nodes=[
     Rule('A',nodes=[Rule('V', nodes=[Token('x')])]),
     Token('+'),
     Rule('A',nodes=[
        Rule('V', nodes=[Token('x')]),
        Token('*'),
        Rule('V', nodes=[Token('y')])
       ])
     ])

The abstract syntax tree for the same expression would look like:

Add(Var('x'),Mul(Var('x'),Var('y')))

Examples using the parser within PyPy

The four parser variants are used from within interpreter/pycompiler.py

API Quickref

Modules

  • interpreter/pyparser contains the tokenizer and grammar parser
  • interpreter/astcompiler contains the soon to be rpythonic compiler
  • interpreter/stablecompiler contains the compiler used by default
  • lib/_stablecompiler contains a modified version of stablecompiler capable of running at application level (that is not recursively calling itself while compiling)

Main facade functions

still empty, but non-empty for ReST

Grammar

still empty, but non-empty for ReST

Long term goals

having enough flexibility in the design that allows us to change the grammar at runtime. Such modification needs to be able to: * modify the grammar representation * modify the ast builder so it produces existing or new AST nodes * add new AST nodes * modify the compiler so that it knows how to deal with new AST nodes

parser implementation

We can now build AST trees directly from the parser. Still missing is the ability to easily provide/change building functions easily. The functions are referenced at interpreter level through a dictionnary mapping that has rule names as keys.

compiler implementation

For now we are working at translating the existing compiler module without changing its design too much. That means we won't have enough flexibility to be able to handle new AST nodes at runtime.

parser module

enhance the parser module interface so that it allows acces to the internal grammar representation, and allows to modify it too.

compiler module

same as above