OpenToken Package Readme
Version 3.0b
The OpenToken package is a facility for performing token analysis
and parsing within the Ada language. It is designed to provide all the
functionality of a traditional lexical analyzer/parser generator, such
as lex/yacc. But due to the magic of inheritance and runtime polymorphism
it is implemented entirely in Ada as withed-in code. No precompilation
step is required, and no messy tool-generated source code is created.
Additionally, the technique of using classes of recognizers promises
to make most token specifications as simple as making an easy to read procedure
call. The most error prone part of generating analyzers, the token pattern
matching, has been taken from the typical user's hands and placed into
reusable classes. Over time I hope to see the addition of enough reusable
recognizer classes that very few users will ever need to write a custom
one. Parse tokens themselves also use this technique, so they ought to
be just as reusable in principle, athough there currently aren't a lot
of predefined parse tokens included in OpenToken.
Ada's type safety features should also make misbehaving analyzers and
parsers easier to debug. All this will hopefully add up to token analyzers
and parsers that are much simpler and faster to create, easier to get working
properly, and easier to understand.
History
Version 3.0b
This version contains another code reorganization to go with another new
parsing facility. This time it is recursive decent parsing. The new method
has the following advantages over table-driven parsers:
-
Its simpler to implement.
-
Its provides many more opportunities for reuse.
-
Its parsers are debugable.
-
There's no expensive parser-generation phase.
The disadvantages are:
-
Its parsers are most likely a bit slower.
Given the above balance, I do intend to make this the standard supported
parsing facility for future versions of OpenToken. The "b" designation
is there to indicate that some things might not be in quite their permanent
form yet, and that there isn't yet the full set of reusable tokens to support
it that I would like to see in a release. I'm hoping for feedback both
in the form of criticisms/suggestions, and reusable tokens in order to
help finalize this facility.
A general list of the changes is below:
-
Renamed OpenToken.Token tree to OpenToken.Token.Enumerated.
-
Created a new (non-enumerated) base token type and base analyzer type in
OpenToken.Token.
-
Made a Parse routine and a Could_Parse_To routine primitives of the base
token type.
-
Created the following predefined nonterminal tokens (both as straight types,
and as mixins).
-
Fixed a bug in the bracketed comment recognizer.
-
Implemented a (hopefully temporary) work-around for a bug in Gnat version
3.13p.
-
Fixed a bug in the string recognizer where it was mishandling octal and
hex escape sequences.
-
Changed the analyzer and the text feeders to support analyzing binary files.
-
The HTML lexer has been improved to be a bit faster and more flexible.
Version 2.0
This is the first version to include parsing capability. The existing packages
underwent a major reorganization to accommodate the new functionality.
As some of the restructuring that was done is incompatible with old code,
the major revision has been bumped up to 2. A partial list of changes is
below:
-
Renamed the top level of the hierarchy from Token to OpenToken.
-
Moved the analyzer underneath the new OpenToken.Token hierarchy.
-
Renamed the Token recognizers from Token.* to OpenToken.Recognizer.*
-
Changed the text feeder procedure pointer into a text feeder object. This
will allow full re-entrancy in analyzers that was thwarted by those global
text feeders previously.
-
Updated the SLOC counter to read a list of files to process from a file.
It also handles files with errors in them a bit better.
-
Added lalr(1) parsing capability and numerous packages to support it. A
structure is in place to build other parsers as well.
-
Created a package hierarchy to support parse tokens. The word "Token" in
OpenToken now refers to objects of this type, rather than to token recognizers.
-
An HTML lexer has been added to the language lexers
-
.Recognizer.Bracketed_Comment now works properly with single-character
terminators.
Version 1.3.6
This version fixes a rare bug in the Ada style based numeric recognizers.
The SLOC counter can now successfully count all the source files in Gnat's
adainclude directory.
Version 1.3.5
This version adds a simple Ada SLOC counting program into the examples.
A bug with the Real token recognizer that caused constraint_errors has
been fixed. Also bugs causing constraint errors in the ada-style based
integer and real recognizers on long non-based numbers have been fixed.
Version 1.3
This version adds the default token capability to the Analyzer package.
This allows a more flexible (if somewhat inefficient) means of error handling
to the analyzer. The default token can be used as an error token, or it
can be made into a non-reportable token to ignore unknown elements entirely.
Identifier tokens were generalized a bit to allow user-defined character
sets for the first and subsequent characters. This not only gives it the
ability to handle syntaxes that don't exacly match Ada's, but it allows
one to define identifiers for languages that aren't latin-1 based. Also,
the ability to turn off non-repeatable underscores was added.
Integer and Real tokens had an option added to support signed literals.
This option is set on by default (which causes a minor backward
incompatibility). Syntaxes that have addition or subtraction operators
will need to turn this option off.
A test to verify proper handling of default parameters was added to
the Test directory. A makefile was also added to the same directory to
facilitate automatic compiling and running of the tests. This makefile
will not work in a non-Gnat/NT environment without some modification.
New recognizers were added for enclosed comments (eg: C's /* */ comments)and
single character escape sequences. Also a "null" recognizer was added for
use as a default token.
Version 1.2.1
This version adds the CSV field token recognizer that was inadvertently
left out of 1.2. This recognizer was designed to match fields in comma-separated
value (CSV) files, which is a somewhat standard file format for databases
and spreadsheets. Also, the extraneous CVS directories in the zip version
of the distribution were removed.
Version 1.2
The long-awaited string recognizer has been added. It is capable of recognizing
both C and Ada-style strings. In addition, there are a great many submissions
by Christoph Grein in this release. He contributed mostly complete lexical
analyzers for both Java and Ada, along with all the extra token recognizers
he needed to accomplish this feat. He didn't need as many extra recognizers
as I would have thought he'd need. But even so, slightly less than 1/2
of the recognizers in this release were contributed by Chris (with a broken
arm, no less!)
Version 1.1
The main code change to this version is a default text feeder function
that has been added to the analyzer. It reads its input from Ada.Text_IO.Current_Input,
so you can change the file to whatever you want fairly easily. The capability
to create and use your own feeder function still exists, but it should
not be necessary in most cases. If you already have code that does this,
it should still compile and work properly.
The other addition is the first version of the OpenToken user's guide.
All it contains right now is a user manual walking through the steps needed
to make a simple token analyzer. Feedback and/or ideas on this are welcome.
Version 1.0
This is the very first publicly released version. This package is based
on work I did while working on the JPATS trainer for FlightSafety International.
The germ of this idea came while I was trying to port a fairly ambitious,
but fatally buggy Ada 83 token recognition package written for a previous
simulator. But once I was done, I was rather suprised at the flexibility
of the final product. Seeing the possible benefit to the community, and
to the company through user-submitted enhancement and debugging, I suggested
that this code be released as Open Source. They were open-minded enough
to agree. Bravo!
Future
As it stands, I am developing and maintaining this package as part of my
master's thesis. Thus you can count on a certain amount of progress in
the next few months
You may notice that most of the stuff I had marked for last release
has been delayed or thrown out. So of course plans do change. :-) But with
that caveat...
Things on my plate for the next release:
-
Better support for error reporting and handling
-
A Reference Manual describing all the routines in all the packages
-
A reference manual generator (with which the above will be created)
Things you can help with:
-
More recognizers - The more of these there are, the more useful this facility
is. If you make 'em, please send 'em in!
-
Generally usable Tokens - I'm not sure there are as many reusable tokens
out there as there are reusable recognizers, but I await pleasant surprises.
-
Useful token support packages. One example would be a generic symbol table
creation/lookup package.
-
Well isolated bug reports (or even fixes). Version 3.0 has quite a few
changes, as did 2.0. So bugs are much more likely in this version than
they have been in the past.
Again, I hope you find this package useful for your needs.
T.E.D. - dennison@telepath.com