Parsing a language

After the tokenizing phase described in the previous section, a parser needs to be defined to effectively read in expressions in order to manipulate them. The fact that all tokens are in memory makes the process of parsing a little easier.

The module genericparser.ys defines some functions that allow definition of a grammar. The algorithm implemented is a recursive-descent parser. This parser can be used with the tokenizer implemented in the rest of the code (that is, the code maintaining indices into the arrays of tokens).

The code is not concerned about whether the grammar is ambiguous per se. Interaction between the tokenizer and the parser might be necessary for some languages (like C++), if the parser is to read preprocessor directives and template definitions correctly (the tokenizer can temporarily switch to another set of definitions of tokens).

The top-level routine to call is:

MatchGrammar(grammar);

Where a grammar is defined as follows. A grammar is either:

A sentence has the form:

{Tag,grammar_1,grammar_2,...,grammar_n}

The result returned will be a list with Tag as the first element (so that pattern matchers can recognize it later), and one element for each grammar matched.

Grammars can be defined with the DefineGrammar macro. The syntax is:

DefineGrammar(name,grammar);

where name is an atom, and other grammars can refer to this grammar using this name, and grammar is the grammar to recognize.

As a very simple example, consider the following definition of a grammar, defined in examplegrammar1.ys:

Use("codecheck.rep/codetransform.ys");
Use("codecheck.rep/genericparser.ys");

Expr'Tokens :=
{
  {"^[\\s]+",Spaces},
  {"^//.*?\\n",Comments},
  {"^[\\(\\)]",Brackets},
  {"^[0-9]+",Atoms},
  {"^[\\+\\-\\*]",Operators},
  {"^[\\;]",Delimiters},
  {"^[a-zA-Z_][a-zA-Z0-9_]*",Atoms},
  {"^.+",UnParsed}
};
DefineGrammar(statement,
  {
    {Statement,expression,";"}
  });
DefineGrammar(expression,
  {
    {Expr2,term,"+",expression},
    {Expr3,term,"-",expression},
    {Expr4,term}
  });
DefineGrammar(term,
  {
    {Term5,factor,"*",term},
    {Term6,factor},
  });
DefineGrammar(factor,
  {
    {SubExpression,"(",expression,")"},
    {SimpleAtom,AcceptToken(Atoms)}
  });

100 # Expr'Test'TokenMap(_rest) <-- rest;
Expr'Test'PreProcess():= [ ];
Expr'Test'PostProcess():=
[
  Local(nrInputTokens);
  nrInputTokens:=Length(CodeTrans'input);
  tokenIndex:=0;
  NextToken();
  While(tokenIndex < nrInputTokens)
  [
    Local(expression);
    expression := MatchGrammar(statement);
    MakePostFix(GetInputTokens(expression));
  ];
];
10 # MakePostFix({Statement,_expression,";"} ) 
   <--
   [
     ConvertExpression(expression);
     Echo(";");
   ];
10 # ConvertExpression({Expr2,_term,"+",_expression})
   <-- 
[
  ConvertExpression(term);
  Space(1);
  ConvertExpression(expression);
  Space(1);
  WriteString("+");
];
10 # ConvertExpression({Expr3,_term,"-",_expression}) 
   <-- 
[
  ConvertExpression(term);
  Space(1);
  ConvertExpression(expression);
  Space(1);
  WriteString("-");
];
10 # ConvertExpression({Expr4,_term}) <-- ConvertExpression(term);
10 # ConvertExpression({Term5,_factor,"*",_term})
   <-- 
[
  ConvertExpression(factor);
  Space(1);
  ConvertExpression(term);
  Space(1);
  WriteString("*");
];
10 # ConvertExpression({Term6,_factor}) <-- ConvertExpression(factor);
10 # ConvertExpression({SimpleAtom,_atom}) <-- WriteString(atom);
10 # ConvertExpression({SubExpression,"(",_expression,")"}) <-- ConvertExpression(expression);
100 # ConvertExpression(_rest) <-- Echo("UNHANDLED! ",rest);

This code defines a simple grammar with addition, subtraction and multiplication. Statements are expressions that are terminated by a ";". Brackets can be used to define a sub-expression. The code further proceeds to convert the input infix expressions to postfix, as the following example interaction will show:

In> Load("examplegrammar1.ys")
Out> True;
In> FromString("A+B*2;")CodeTransform("Expr","Test")
A B 2 * +;
A+B*2;Out> True;
In> FromString("(A+B)*2;")CodeTransform("Expr","Test")
A B + 2 *;
(A+B)*2;Out> True;

This example also clearly shows how transformation rules would map naturally to the lists returned by the parser. The variable parts of the sentences can just be matched to variables.

The function also outputs the original input, which has not been transformed. This facility would allow converting the code in place.