j-guru-blue.jpg (8086 bytes)

ANTLR

jGuru

Sather Runtime Model


Programmer's Interface

In this section, we describe what ANTLR generates after reading your grammar file and how to use that output to parse input. The classes from which your lexer, token, and parser classes are derived are provided as well.

What ANTLR generates

ANTLR generates the following types of files, where MY_PARSER, MY_LEXER, and MY_TREE_PARSER are names of grammar classes specified in the grammar file. You may have an arbitrary number of parsers, lexers, and tree-parsers per grammar file; a separate class file will be generated for each. In addition, token type files will be generated containing the token vocabularies used in the parsers and lexers. One or more token vocabularies may be defined in a grammar file, and shared between different grammars. For example, given the grammar file:

options {
  language="Sather";
}

class MyParser extends Parser;
options {
  exportVocab=MY;
}
... rules ...

class MY_LEXER extends Lexer;
options {
  exportVocab=MY;
}
... rules ...

class MY_TREE_PARSER extends TreeParser;
options {
  exportVocab=MY;
}
... rules ...

The following files will be generated:

  • MY_PARSER.sa. The parser with member methods for the parser rules.
  • MY_LEXER.sa. The lexer with the member methods for the lexical rules.
  • MY_TREE_PARSER.sa. The tree-parser with the member methods for the tree-parser rules.
  • MY_TOKENTYPES.sa. An interface containing all of the token types defined by your parsers and lexers using the exported vocabulary named MY.
  • MY_TokenTypes.txt. A text file containing all of the token types, literals, and paraphrases defined by parsers and lexers contributing vocabulary MY.

The programmer uses the classes by referring to them:

  1. Create a lexical analyzer.
  2. Create a parser and attach it to the lexer (or another $ANTLR_TOKEN_STREAM).
  3. Call one of the methods in the parser to begin parsing.

If your parser generates an AST, then get the AST value, create a tree-parser, and invoke one of the tree-parser rules using the AST.

lex ::= #MY_LEXER{ANTLR_COMMON_TOKEN}( file_stream );
parser ::= MY_PARSER{ANTLR_COMMON_TOKEN,ANTLR_COMMON_AST}( lex, user-defined-args-if-any );
parser.start-rule;
-- and, if you are tree parsing the result...
tree_parser ::= #MY_TREE_PARSER{ANTLR_COMMON_AST};
tree_parser.start-rule( parser.ast );

The lexer and parser can cause exceptions of type $ANTLR_RECOGNITION_EXCEPTIONS, which you may catch:

  lexer ::= #CALC_LEXER{ANTLR_COMMON_TOKEN}( file_stream );
  parser ::= #CALC_PARSER{ANTLR_COMMON_TOKEN,ANTLR_COMMON_AST}(lexer);
  -- Parse the input expression
  protect 
    parser.expr;
  when $ANTLR_RECOGNITION_EXCEPTION
    #ERR + exception.str + "\n";
  end;

Multiple Lexers/Parsers With Shared Input State

Occasionally, you will want two parsers or two lexers to share input state; that is, you will want them to pull input from the same source token stream or character stream.  

ANTLR 2.6.0 factored the input variables such as line number, guessing state, input stream, etc... into a separate object so that another lexer or parser could same that state.  The ANTLR_LEXER_SHARED_INPUT_STATE and ANTLR_PARSER_SHARED_INPUT_STATE embody this factoring.   Attribute input_state can be used on either ANTLR_CHAR_SCANNER or ANTLR_PARSER objects.  Here is how to construct two lexers sharing the same input stream:

-- create a Java lexer
main_lexer ::= #JAVA_LEXER{ANTLR_COMMON_TOKEN}( input );
-- create javadoc lexer
-- attach to shared input state of java lexer
doclexer ::= #JAVADOC_LEXER{ANTLR_COMMON_TOKEN}( main_lexer.input_state );

Parsers with shared input state can be created similarly:

jdocparser ::=  #JAVA_DOC_PARSER{ANTLR_COMMON_TOKEN,ANTLR_COMMON_AST}( input_state );
jdocparser.content; -- go parse the comment

Sharing state is easy, but what happens upon exception during the execution of the "subparser"?  What about syntactic predicate execution?  It turns out that invoking a subparser with the same input state is exactly the same as calling another rule in the same parser as far as error handling and syntactic predicate guessing are concerned.  If the parser is guessing before the call to the subparser, the subparser must continue guessing, right?  Exceptions thrown inside the subparser must exit the subparser and return to enclosing erro handler or syntactic predicate handler.

Parser Implementation

Parser Class

ANTLR generates a parser class (an extension of ANTLR_LLKPARSER) that contains a method for every rule in your grammar. The general format looks like:

class MY_PARSER{ TOKEN < $ANTLR_TOKEN, AST < $ANTLR_AST{AST} } is

  include ANTLR_LLKPARSER{ TOKEN, AST } create -> super_create;
  include CALC_PARSER_TOKENTYPES;

  create ( token_buf : ANTLR_TOKEN_BUFFER{TOKEN} , k : INT ) : SAME is
    res : SAME := super_create( token_buf, k );
    res.token_names := sa_token_names;
    return res;
  end;

  create ( token_buf : ANTLR_TOKEN_BUFFER{TOKEN} ) : SAME is
    return #SAME( token_buf, 1);
  end;

  create ( lexer : $ANTLR_TOKEN_STREAM{TOKEN} , k : INT ) : SAME is
    res : SAME := super_create( lexer, k );
    res.token_names := sa_token_names;
    return res;
  end;

  create( lexer : $ANTLR_TOKEN_STREAM{TOKEN} ) : SAME is
    res : SAME := #SAME( lexer, 1);
    return res;
  end;

  create ( state : ANTLR_PARSER_SHARED_INPUT_STATE{TOKEN} ) : SAME is 
    res : SAME := super_create( state,1);
    res.token_names := sa_token_names;
    return res;
  end;
  ...
  -- add your own constructors here...
  rule-definitions
end;

Parser Methods

ANTLR generates recursive-descent parsers, therefore, every rule in the grammar will result in a method that applies the specified grammatical structure to the input token stream. The general form of a parser method looks like:

rule is
  init-action-if-present
  if ( lookahead-predicts-production-1 ) then
     code-to-match-production-1
  elsif ( lookahead-predicts-production-2 ) then
     code-to-match-production-2
  ...
  elsif ( lookahead-predicts-production-n ) then
     code-to-match-production-n
  else 
    -- syntax error
    raise #ANTLR_NO_VIABLE_ALT_EXCEPTION(LT(1));
  end;
end;
This code results from a rule of the form:
rule:   production-1
    |   production-2
   ...
    |   production-n
    ;

If you have specified arguments and a return type for the rule, the method header changes to:

(* generated from:
 *    rule(user-defined-args)
 *      returns return-type : ... ;
 *)
rule( user-defined-args ) : return-type is
  ...
end;

Token types are integers and we make heavy use of sets and range comparisons to avoid excessively-long test expressions.

EBNF Subrules

Subrules are like unlabeled rules, consequently, the code generated for an EBNF subrule mirrors that generated for a rule. The only difference is induced by the EBNF subrule operators that imply optionality or looping.

(...)? optional subrule. The only difference between the code generated for an optional subrule and a rule is that there is no default else-clause to throw an exception--the recognition continues on having ignored the optional subrule.

  init-action-if-present
  if ( lookahead-predicts-production-1 ) then
     code-to-match-production-1

  elsif ( lookahead-predicts-production-2 ) then
     code-to-match-production-2

  ...
  elsif ( lookahead-predicts-production-n ) then
     code-to-match-production-n
  end;

Not testing the optional paths of optional blocks has the potential to delay the detection of syntax errors.

(...)* closure subrule. A closure subrule is like an optional looping subrule, therefore, we wrap the code for a simple subrule in a "forever" loop that exits whenever the lookahead is not consistent with any of the alternative productions.

  init-action-if-present
  loop
    if ( lookahead-predicts-production-1 ) then
       code-to-match-production-1

    elsif ( lookahead-predicts-production-2 ) then
       code-to-match-production-2

    ...
    elsif ( lookahead-predicts-production-n ) then
       code-to-match-production-n

    else 
      break!;

    end;
  end;

While there is no need to explicity test the lookahead for consistency with the exit path, the grammar analysis phase computes the lookahead of what follows the block. The lookahead of what follows much be disjoint from the lookahead of each alternative otherwise the loop will not know when to terminate. For example, consider the following subrule that is nondeterministic upon token A.

( A | B )* A

Upon A, should the loop continue or exit? One must also ask if the loop should even begin. Because you cannot answer these questions with only one symbol of lookahead, the decision is non-LL(1).

Not testing the exit paths of closure loops has the potential to delay the detection of syntax errors.

As a special case, a closure subrule with one alternative production results in:

  init-action-if-present
  loop while!( lookahead-predicts-production-1 );
    code-to-match-production-1
  end;
 

This special case results in smaller, faster, and more readable code.

(...)+ positive closure subrule. A positive closure subrule is a loop around a series of production prediction tests like a closure subrule. However, we must guarantee that at least one iteration of the loop is done before proceeding to the construct beyond the subrule.

  sa_cnt : INT := 0;
  init-action-if-present
  loop
    if ( lookahead-predicts-production-1 ) then
       code-to-match-production-1

    elsif ( lookahead-predicts-production-2 ) then
       code-to-match-production-2

    ...
    elsif ( lookahead-predicts-production-n ) then
       code-to-match-production-n

    elsif ( sa_cnt>1 ) then
      -- lookahead predicted nothing and we've
      -- done an iteration
      break!;

    else 
      raise #ANTLR_NO_VIABLE_ALT_EXCEPTION(LT(1));

    end;
    sa_cnt := sa_cnt + 1;  -- track times through the loop

  end;

While there is no need to explicity test the lookahead for consistency with the exit path, the grammar analysis phase computes the lookahead of what follows the block. The lookahead of what follows much be disjoint from the lookahead of each alternative otherwise the loop will not know when to terminate. For example, consider the following subrule that is nondeterministic upon token A.

( A | B )+ A

Upon A, should the loop continue or exit? Because you cannot answer this with only one symbol of lookahead, the decision is non-LL(1).

Not testing the exit paths of closure loops has the potential to delay the detection of syntax errors.

You might ask why we do not have a while loop that tests to see if the lookahead is consistent with any of the alternatives (rather than having series of tests inside the loop with a break). It turns out that we can generate smaller code for a series of tests than one big one. Moreover, the individual tests must be done anyway to distinguish between alternatives so a while condition would be redundant.

As a special case, if there is only one alternative, the following is generated:

  init-action-if-present
  loop
    code-to-match-production-1
    if ( lookahead-predicts-production-1 ) then
      break!;
    end;
  end;

Optimization. When there are a large (where large is user-definable) number of strictly LL(1) prediction alternatives, then a case-statement can be used rather than a sequence of if-statements. The non-LL(1) cases are handled by generating the usual if-statements in the else case. For example:

case ( LA(1) ) 
  when KEY_WHILE, KEY_IF, KEY_DO then
    statement;
  when KEY_INT, KEY_FLOAT then
    declaration;
  else 
    -- do whatever else-clause is appropriate
end;

This optimization relies on the compiler building a more direct jump (via jump table or hash table) to the ith production matching code. This is also more readable and faster than a series of set membership tests.

Production Prediction

LL(1) prediction. Any LL(1) prediction test is a simple set membership test. If the set is a singleton set (a set with only one element), then an integer token type = comparison is done. If the set degree is greater than one, a set is created and the single input token type is tested for membership against that set. For example, consider the following rule:

a : A | b ;
b : B | C | D | E | F;

The lookahead that predicts production one is {A} and the lookahead that predicts production two is {B,C,D,E,F}. The following code would be generated by ANTLR for rule a (slightly cleaned up for clarity):

a is 
  if ( LA(1) = A ) then
    match(A);

  elsif ( token_set1.member(LA(1)) ) then
    b;

  end;

end;

The prediction for the first production can be done with a simple integer comparison, but the second alternative uses a set membership test for speed, which you probably didn't recognize as testing LA(1) member {B,C,D,E,F}. The complexity threshold above which set-tests are generated is user-definable. We use arrays of BOOLs to hold sets. The various sets needed by ANTLR are created and initialized in the generated parser (or lexer) class.

Approximate LL(k) prediction. An extension of LL(1)...basically we do a series of up to k set tests rather than a single as we do in LL(1) prediction. Each decision will use a different amount of lookahead, with LL(1) being the dominant decision type.

Production Element Recognition

Token references. Token references are translated to:

match(token-type);

For example, a reference to token KEY_BEGIN results in:

match(KEY_BEGIN);

where KEY_BEGIN will be an integer constant defined in the MY_PARSER_TOKENTYPES class generated by ANTLR.

String literal references. String literal references are references to automatically generated tokens to which ANTLR automatically assigns a token type (one for each unique string). String references are translated to:

match(T);

where T is the token type assigned by ANTLR to that token.

Character literal references. Referencing a character literal implies that the current rule is a lexical rule. Single characters, 't', are translated to:

match('t');

which can be manually inlined with:

  if ( c = 't' ) then 
    consume;
  else 
    raise #ANTLR_NO_VIABLE_ALT_FOR_CHAR_EXCEPTION( LA(1), file_name, line );
  end;

if the method call proves slow (at the cost of space).

Wildcard references. In lexical rules, the wildcard is translated to:

  consume;

which simply gets the next character of input without doing a test.

References to the wildcard in a parser rule results in the same thing except that the consume call will be with respect to the parser.

Not operator. When operating on a token, ~T is translated to:

match_not( T );

When operating on a character literal, 't' is translated to:

match_Not( 't' );

Range operator. In parser rules, the range operator (T1..T2) is translated to:

match_range( T1, T2 );

In a lexical rule, the range operator for characters c1..c2 is translated to:

match_range( c1, c2 );

Labels. Element labels on atom references become TOKENS references in parser rules and INTs in lexical rules. For example, the parser rule:

a : id:ID { OUT::create + "id is " + id + '\n'; }
  ;
would be translated to:
a is
  id : TOKEN := void;
  id := LT(1);
  match(ID);
  OUT::create + "id is " + id + '\n';
end;
For lexical rules such as:
ID : w:. { OUT::create + "w is "+ w + '\n'; }
   ;
the following code would result:
ID is
  w : CHAR;
  w := c;
  consume; -- match wildcard (anything)
  OUT::create + "w is "+ w + '\n';
end;

Labels on rule references result in AST references, when generating trees, of the form label_ast.

Rule references. Rule references become method calls. Arguments to rules become arguments to the invoked methods. Return values are assigned like Sather assignments. Consider rule reference i=list[1] to rule:

list[scope:INT] returns INT
    :   { return scope+3; }
    ;
The rule reference would be translated to:
i := list(1);

Semantic actions. Actions are translated verbatim to the output parser or lexer except for the translations required for AST generation.

To add members to a lexer or parser class definition, add the class member definitions enclosed in {} immediately following the class specification, for example:

class MY_PARSER extends Parser;
{
   private i : INT;
   create ( lexer : ANTLR_TOKEN_STREAM{TOKEN}, aUsefulArgument : INT ) : SAME is
      i := aUsefulArgument;
   end;
}
... rules ...

ANTLR collects everything inside the {...} and inserts it in the class definition before the rule-method definitions.

Semantic predicates.

Lexer Implementation

Lexer Form

The lexers produced by ANTLR 2.x are a lot like the parsers produced by ANTLR 2.x. They only major differences are that (a) scanners use characters instead of tokens, and (b) ANTLR generates a special next_token rule for each scanner which is a production containing each public lexer rule as an alternate. The name of the lexical grammar class provided by the programmer results in a subclass of ANTLR_CHARS_CANNER, for example

class MY_LEXER{TOKEN} < $ANTLR_TOKEN_STREAM{TOKEN} , $ANTLR_FILE_CURSOR is

  include ANTLR_CHAR_SCANNER{TOKEN} create -> private char_scanner_create;
  include CALC_PARSER_TOKENTYPES;

  create ( istr : $ISTREAM ) : SAME is
    ...
  end;

  create ( bb : ANTLR_BYTE_BUFFER ) : SAME is
    ...
  end;

  create ( state : ANTLR_LEXER_SHARED_INPUT_STATE ) : SAME is 
    ...
  end;

  next_token : TOKEN is
     scanning logic
    ...
  end;
  recursive and other non-inlined lexical methods
  ...

end;

When an ANTLR-generated parser needs another token from its lexer, it calls a method called next_token. The general form of the next_token method is:

next_token : TOKEN is
  ss_ttype : INT;
  loop
     protect 
        reset_text;
        case ( LA(1) ) 
        case for each char predicting lexical rule
           call lexical rule gets token type -> sa_ttype
        else 
           raise #ANTLR_NO_VIABLE_ALT_FOR_CHAR_EXCEPTION( LA(1), file_name, line );
        end;

        if ( sa_ttype /= ANTLR_COMMON_TOKEN::SKIP ) then 
           return make_token( sa_ttype );
        end;

     when $ANTLR_RECOGNITION_EXCEPTION then
        report_error( exception.str );

     end;
  end;
end;

For example, the lexical rules:

class LEX extends Lexer;

WS   : ('\t' | '\r' | ' ') {sa_ttype := ANTLR_COMMON_TOKEN::SKIP;} ;
PLUS : '+';
MINUS: '-';
INT  : ( '0'..'9' )+ ;
ID   : ( 'a'..'z' )+ ;
UID  : ( 'A'..'Z' )+ ;
would result in something like:
class LEX{TOKEN} < $ANTLR_TOKEN_STREAM{TOKEN} , $ANTLR_FILE_CURSOR is 
	
   next_token : TOKEN is
      sa_rettoken : TOKEN;
      continue : BOOL := true;
      loop
	 sa_ttype : INT := ANTLR_COMMON_TOKEN::INVALID_TYPE;
	 reset_text;
	 protect		-- for char stream error handling
	    protect		-- for lexical error handling
	       case ( LA(1) )
	       when '\t'  , '\r'  , ' '
	       then
		  mWS( true );
		  sa_rettoken := sa_return_token;
	       when '+'
	       then
		  mPLUS( true );
		  sa_rettoken := sa_return_token;
	       when '-'
	       then
		  mMINUS( true );
		  sa_rettoken := sa_return_token;
	       when '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
	       then
		  mINT( true );
		  sa_rettoken := sa_return_token;
	       when 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 
			'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'  then
		  mID( true );
		  sa_rettoken := sa_return_token;
	       when 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 
			'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'  then
		  mUID( true );
		  sa_rettoken := sa_return_token;
	       else		-- default
		  if ( LA(1) = EOF_CHAR ) then 
		     upon_eof; 
		     sa_return_token := make_token( ANTLR_COMMON_TOKEN::EOF_TYPE);
		  else 
		     raise #ANTLR_NO_VIABLE_ALT_FOR_CHAR_EXCEPTION( LA(1), file_name, line ); 
		  end; -- if
	       end; -- case
          
	       if ( ~void(sa_return_token) and continue ) then;
		  sa_ttype := sa_return_token.ttype;
		  sa_ttype := test_literals_table(sa_ttype);
		  sa_return_token.ttype := sa_ttype;
		  return sa_return_token;
	       end; -- if
	    when $ANTLR_RECOGNITION_EXCEPTION then
	       report_error( exception );
	       consume;
	    end; -- protect
	 when $ANTLR_CHAR_STREAM_EXCEPTION then
	    raise #ANTLR_TOKEN_STREAM_EXCEPTION( exception.message );
	 end; -- protect
      end; -- loop
   end; -- next_token
  
   mWS( sa_create_token : BOOL ) is
      sa_ttype : INT; 
      sa_token : TOKEN; 
      sa_begin : INT := text.length;
      sa_ttype := WS;
      sa_save_index : INT;
    
      case ( LA(1) )
      when '\t'
      then
	 match('\t');
      when '\r'
      then
	 match('\r');
      when ' '
      then
	 match(' ');
      else
	 raise #ANTLR_NO_VIABLE_ALT_FOR_CHAR_EXCEPTION( LA(1), file_name, line );
      end; -- case
    
      sa_ttype := ANTLR_COMMON_TOKEN::SKIP;
      if ( sa_create_token and void(sa_token) and sa_ttype /= ANTLR_COMMON_TOKEN::SKIP ) then
	 sa_token := make_token( sa_ttype );
	 sa_token.text := text.substring( sa_begin, text.length - sa_begin );
      end; -- if
      sa_return_token := sa_token;
   end; -- rule
  
   mPLUS( sa_create_token : BOOL ) is
      sa_ttype : INT; 
      sa_token : TOKEN; 
      sa_begin : INT := text.length;
      sa_ttype := PLUS;
      sa_save_index : INT;
    
      match('+');
      if ( sa_create_token and void(sa_token) and sa_ttype /= ANTLR_COMMON_TOKEN::SKIP ) then
	 sa_token := make_token( sa_ttype );
	 sa_token.text := text.substring( sa_begin, text.length - sa_begin );
      end; -- if
      sa_return_token := sa_token;
   end; -- rule
  
   mMINUS( sa_create_token : BOOL ) is
      sa_ttype : INT; sa_token : TOKEN; sa_begin : INT := text.length;
      sa_ttype := MINUS;
      sa_save_index : INT;
    
      match('-');
      if ( sa_create_token and void(sa_token) and sa_ttype /= ANTLR_COMMON_TOKEN::SKIP ) then
	 sa_token := make_token( sa_ttype );
	 sa_token.text := text.substring( sa_begin, text.length - sa_begin );
      end; -- if
      sa_return_token := sa_token;
   end; -- rule
  
   mINT( sa_create_token : BOOL ) is
      sa_ttype : INT;
      sa_token : TOKEN;
      sa_begin : INT := text.length;
      sa_ttype := INT;
      sa_save_index : INT;
    
      sa0_cnt7 : INT := 0;
      loop
	 if (((LA(1) >= '0' and LA(1) <= '9'))) then
	    match_range( '0', '9' );
	 else
	    if ( sa0_cnt7 >= 1 ) then 
	       break! 
	    else 
	       raise #ANTLR_NO_VIABLE_ALT_FOR_CHAR_EXCEPTION( LA(1), file_name, line ); 
	    end; -- if
	 end; -- if
      
	 sa0_cnt7 := sa0_cnt7 + 1;
      end; -- loop
      if ( sa_create_token and void(sa_token) and sa_ttype /= ANTLR_COMMON_TOKEN::SKIP ) then
	 sa_token := make_token( sa_ttype );
	 sa_token.text := text.substring( sa_begin, text.length - sa_begin );
      end; -- if
      sa_return_token := sa_token;
   end; -- rule
  
   mID( sa_create_token : BOOL ) is
      sa_ttype : INT; sa_token : TOKEN; sa_begin : INT := text.length;
      sa_ttype := ID;
      sa_save_index : INT;
    
      sa1_cnt10 : INT := 0;
      loop
	 if (((LA(1) >= 'a' and LA(1) <= 'z'))) then
	    match_range( 'a', 'z' );
	 else
	    if ( sa1_cnt10 >= 1 ) then 
	       break! 
	    else 
	       raise #ANTLR_NO_VIABLE_ALT_FOR_CHAR_EXCEPTION( LA(1), file_name, line ); 
	    end; -- if
	 end; -- if
      
	 sa1_cnt10 := sa1_cnt10 + 1;
      end; -- loop
      if ( sa_create_token and void(sa_token) and sa_ttype /= ANTLR_COMMON_TOKEN::SKIP ) then
	 sa_token := make_token( sa_ttype );
	 sa_token.text := text.substring( sa_begin, text.length - sa_begin );
      end; -- if
      sa_return_token := sa_token;
   end; -- rule
  
   mUID( sa_create_token : BOOL ) is
      sa_ttype : INT; sa_token : TOKEN; sa_begin : INT := text.length;
      sa_ttype := UID;
      sa_save_index : INT;
    
      sa2_cnt13 : INT := 0;
      loop
	 if (((LA(1) >= 'A' and LA(1) <= 'Z'))) then
	    match_range( 'A', 'Z' );
	 else
	    if ( sa2_cnt13 >= 1 ) then 
	       break! 
	    else 
	       raise #ANTLR_NO_VIABLE_ALT_FOR_CHAR_EXCEPTION( LA(1), file_name, line ); 
	    end; -- if
	 end; -- if
      
	 sa2_cnt13 := sa2_cnt13 + 1;
      end; -- loop
      if ( sa_create_token and void(sa_token) and sa_ttype /= ANTLR_COMMON_TOKEN::SKIP ) then
	 sa_token := make_token( sa_ttype );
	 sa_token.text := text.substring( sa_begin, text.length - sa_begin );
      end; -- if
      sa_return_token := sa_token;
   end; -- rule
  
end; -- class

ANTLR-generated lexers assume that you will be reading streams of characters. If this is not the case, you must create your own lexer.

Creating Your Own Lexer

To create your own lexer, your Sather class that will doing the lexing must implement interface $ANTLR_TOKEN_STREAM, which simply states that you must be able to return a stream of tokens conforming to $ANTLR_TOKEN via next_token:

abstract class $ANTLR_TOKEN_STREAM{TOKEN < $ANTLR_TOKEN} is
  next_token : TOKEN;
end;

ANTLR will not generate a lexer if you do not specify a lexical class.

Launching a parser with a non-ANTLR-generated lexer is the same as launching a parser with an ANTLR-generated lexer:

lex ::= #HAND_BUILT_LEXER{MY_TOKEN}(...);
p ::= #MY_PARSER{MY_TOKEN,ANTLR_COMMON_AST}(lex);
p.start-rule;

The parser does not care what kind of object you use for scanning as as long as it can answer next_token.

If you build your own lexer, and the token values are also generated by that lexer, then you should inform the ANTLR-generated parsers about the token type values generated by that lexer. Use the importVocab in the parsers that use the externally-generated token set, and create a token definition file following the requirements of the importVocab option.

Lexical Rules

Lexical rules are essentially the same as parser rules except that lexical rules apply a structure to a series of characters rather than a series of tokens. As with parser rules, each lexical rule results in a method in the output lexer class.

Alternative blocks. Consider a simple series of alternatives within a block:

FORMAT : 'x' | 'f' | 'd';

The lexer would contain the following method:

mFORMAT is
  if ( c  = 'x' ) then
    match('x');

  elsif ( c = 'x' ) then
    match('x');

  elsif ( c = 'f' ) then
    match('f');

  elsif ( c = 'd' ) then
    match('d');

  else 
    raise #ANTLR_NO_VIABLE_ALT_FOR_CHAR_EXCEPTION( ... );
  end;

  return FORMAT;
end;

The only real differences between lexical methods and grammar methods are that lookahead prediction expressions do character comparisons rather than LA(i) comparisons, match matches characters instead of tokens, and a return is added to the bottom of the rule.

For example, the common identifier rule would be placed directly into the next_token method. That is, rule:

ID  :   ( 'a'..'z' | 'A'..'Z' )+
    ;

would not result in a method in your lexer class. This rule would become part of the resulting lexer as it would be probably inlined by ANTLR:

next_token : TOKEN is
  case ( LA(1) ) 
  cases for operators and such here
  -- chars that predict ID token
  case '0', '1', '2', '3' '4', '5', '6', '7', '8', '9' then
    loop while!( c > ='0' and c < ='9' );
      match_range( '0' , '9' );
    end;
    return make_token(ID);
  else 
    check harder stuff here like rules
      beginning with a..z
end;

If not inlined, the method for scanning identifiers would look like:

mID : TOKEN is
  loop while!( c > = '0' and c < = '9' )
    match_range( '0' , '9' );
  end;
  return ID;
end;

where token names are converted to method names by prefixing them with the letter m. The next_token method would become:

next_token : TOKEN is
  case ( c ) 
  cases for operators and such here
  -- chars that predict ID token
  when '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'  
    return make_token( mID );
  else
    check harder stuff here like rules beginning with a..z
end;

Note that this type of range loop is so common that it should probably be optimized to:

loop while! ( c >= '0' and c <= '9' );
    consume;
end;

Optimization: Recursive lexical rules. Lexical rules that are directly or indirectly recursive are not inlined. For example, consider the following rule that matches nested actions:

ACTION
    :   '{' ( ACTION | ~'}' )* '}'
    ;

ACTION would be result in (assuming a character vocabulary of 'a'..'z', '{', '}'):

mACTION : INT is
    sa_ttype : INT := ACTION;
    match('{');
    loop
        case ( LA(1) ) 
        when '{' then
            mACTION;
        when 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 
             'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z' then
            match_not('}');
        else
            break!;
        end;
    end;
    match('}');
    return sa_ttype;
end;

Version: $Id: //depot/code/org.antlr/release/antlr-2.7.1/doc/sa-runtime.html#1 $