This documentation comes with some implementation for the little Lisp dialect introduced in this document. It is implemented in Java. This section discusses some details pertaining to the implementation. It will discuss some features of the language design.
From the other side, exceptions in Java for instance, can not be ignored. A routine has to handle it, or promise to pass off the exception to the calling environment. But then all the functions calling the function must handle the exception, resulting in much more typing work.
This concept can be put to great use in for instance a typing system, where it can be made very hard to pass an arbitrary-type object around, like the void * of c or c++, or a GenericLispObjectHugeTypeDeclarator in a Lisp system, where type declarations can be enforced. In addition to this, it can be made harder to pass around wrong types by forcing hand-written type-conversions. An object of a generic type can be passed in, but if it results in considerable extra effort on the behalf of the programmer to do type conversions in the code, the programmer might consider declaring the object of the correct type in the first place.
C++ does this the other way round; the language makes it particularly easy to do automatic type conversions between types, sometimes resulting in dangerous code.
A lot of algorithms working on data can get away with this sequential processing, performing an operation on an object, and then moving to the next object to perform the same operation on it.
The classic way in Lisp has been to have a list of objects, and have the first function return the first of the list, rest return the rest (the objects that still need processing), and when the list is empty, all objects have been processed.
The reason to make this abstraction is that for specific cases optimizations can be made in the way the container is implemented internally. For instance, in stead of a list, an array could be used, which gives constant-time access to an element in a container, but slow O(n) insertion time. Many more options are available; objects could be obtained from a compressed block of data, or read in sequentially from disk (for huge sets of objects), or a set of objects could be defined symbolically, like (range 1 Infinity). A routine could then be written that finds the first occurrence of an integer that has certain properties, for instance. One would need to have a function first and rest and null for arrays and for objects of the form (range <from> <to>).
This is an important issue, because computers are good at doing a dull task, but doing it millions of times over. One option is to have a huge data set and process it. An efficient way of walking over a sequence of data objects is then necessary.
Packaged environments are similar to the well-known singleton classes from c++, which are in turn again similar to global functions and variables. The disadvantage of having things global is mainly that you can only have one instance of the global state of that module. This is not a problem for modules for which there is naturally only one instance.
Packaged modules are usually of this form. They can be seen as a name space, where certain symbols are defined to have a meaning locally. Specific symbols from a module can be exported to another module, thus the symbol of another module becomes visible in another name space.
As mentioned above, modules in the Lisp dialect developed here have a very simple structure; they are essentially what is usually referred to as a class, but there will be only one instantiation of each module, and modules are not 'derived' in any form from other modules. Modules can export functions (or more precisely, modules can import functions from other modules) when appropriate, but this feature should be used with care as this easily results in re-introducing name collisions in the name space of a module.
Modules have their own global variables. When a function from a module is invoked, it can only see its own globals, even if exported to another environment. This has the following advantages:
When well-thought out, packages can greatly aid in making a system modular, built from small sub-modules which can be added or removed.
An environment can be built up with makeenv, and defining the functions and global variables contained therein:
(makeenv arithmetic (progn (defun add (n m) ...) (setf pi 3.1415926535) ) ) |
An expression can then be evaluated in this environment through:
(namespace <module-name> <function-name> <arguments...>) |
The 'environment' can be thought of as a static class, a singleton, in object-oriented terms. There is only one of this type of object, and one instance of each property in the class, which is why the compiler can generate a static link to it. A function in an environment is no more than a method in such a singleton class.
The arithmetic operations like + are actually defined in the module INTEGER for the reference implementation of the Lisp interpreter described in this book. This means that from start up, the + operator is not directly accessible. One way to access a function in a module is to pass the name of the module as an argument to the operator namespace:
In> (+ 2 3) ERROR: function not defined : + Out> nil In> (namespace 'INTEGER '+ 2 3) Out> 5 |
This is obviously not the most convenient syntax. One way to encapsulate a call to a function in a module is to define a function for it:
In> (defun +(a b) (namespace 'INTEGER '+ a b)) Out> + In> (+ 2 3) Out> 5 |
This way we have back our easy access to the functionality of adding two integers. However, we are left with cumbersome syntax for importing this function. For this the function import was defined, which makes importing a function from another module into a module a little bit easier:
In> (import INTEGER + AddInt) Out> t In> (AddInt 2 3) Out> 5 |
Note that for this language we chose the construct of namespace because we don't want access to global variables in the module. Another option would have been to allow passing an environment, or module name to eval, as in (eval <expression> <module-name>). This would have allowed the programmer to pass in any expression to the environment, effectively making it possible to evaluate variables in that module. We want to actively disallow that, by only allowing function calls to that environment. The arguments to the function call are evaluated before the function defined in the module is called, thus they are evaluated in the calling environment.
To define a module, makeenv can be used. makeenv takes two arguments, a module name and the body to execute to define globals and functions inside it. For example, we can define a function bar in the module foo as follows:
In> (makeenv foo (defun bar(n m) (list 'head n m))) Out> bar In> (bar 2 3) ERROR: function not defined : bar Out> nil In> (import foo bar bar) Out> t In> (bar 2 3) Out> (head 2 3 ) In> (namespace 'foo 'bar 2 3) Out> (head 2 3 ) |
The function functionp allows one to check that a function is defined in a package. This is useful for checking that a package name that is passed as an argument is a valid package (has the correct functions implemented). The syntax is (functionp <namespace> <functionname>):
In> (functionp 'std '+) Out> nil In> (functionp 'INTEGER '+) Out> t In> (import INTEGER + +) Out> t In> (functionp 'std '+) Out> t |
The default environment (module) the application starts in, the one the user can access, is the name space std. Other name spaces are INTEGER,
Note that INTEGER is both the name of a name space and of a type. This has powerful consequences: of we define a vector to be a vector of integers, as some structure with VECTOR INTEGER in it, the code to add two vectors knows that the components of the vector can be added by using the addition operation from the INTEGER module directly, resulting in speedier code than code that would accept the components as being of type OBJECT, and figuring out for each component that it was an integer.
The system can even go as far as checking that a parametric type can be made by checking that certain operations are defined for it, through the functionp call.
As such, the programming language also offers an interface to operating system functionality through modules supporting these operations.
{quote, if, cond, setf, while, progn, defun, defmacro, assoc, eval, first, rest, cons, atom, listp, list, equal, namespace, import, makeenv, functionp}.
All other commands (including arithmetic commands) can be accessed through other modules.
For modules representing some type, usually there should at least be a type function which returns t if an object is of that type, nil otherwise, and a function make which makes such an object given some arguments (if applicable).
It is defendable to use exceptions, as the problem is likely to be invalid input to a function, and as a result invalid input passed in by the user, and as such an environment issue that the code cannot be expected to treat correctly.
Exceptions halt execution immediately. The preEval catches it, and the user interface can continue where it left off. This is defendable behavior. The system is only expected to do on-line commands typed in by the user, or fail. The story would be different if Yacas were used to control some system, and it needs to keep running after failure, but it was not designed for that. Yacas is a computer algebra system, not a controlling system for a Space Shuttle.
The Lisp interpreter described so far is untyped, meaning that everything is of one type, OBJECT. What happens in fact is that the interpreter executes functions, and these functions internally do conversions. When calling an arithmetic operator like +, arguments of type OBJECT are converted to NUMBER, the addition is performed, and the result is converted back from type NUMBER to OBJECT. Untyped systems are inefficient this way.
The compiled Lisp dialect that will be developed will allow for typed arguments and functions; functions can only accept arguments of a certain type, and return a result of a specific type. The typed Lisp dialect will still allow for untyped (eg. OBJECT) arguments, for the lazy programmer (but as we shall see later, the lazy programmer is in for a few nasty surprises: slower code and more typing required, as a lot of manually coded type conversions will need to be made). With the simple typing system we will use, no automatic type conversions will be made. They will have to be done manually.
The system will consist of various packages involving operations on objects. For instance, there will be a package that performs addition on integers and real numbers, and there will be a package that deals with addition of vectors and matrices. The same will hold for for instance complex numbers. The addition operator in the global environment will then be responsible for dispatching an addition operator to the appropriate addition operator in the appropriate environment. For instance, there can be an AddIntegers operator, and an AddComplexNumbers operator, and the invoking addition operator can first check argument types before invoking the correct operator. This is called data-directed programming.
Complications arise when two objects of different types are added. When adding an integer to a real number, one cannot use the operators for adding integers or real numbers themselves. One solution for this is to first convert the integer to a real number type, and then invoke the addition of real numbers operator. For numeric types, this works well. Numbers can be put in what is called a tower of types: integers are a sub-type of rational numbers, which are a sub-type of real numbers, which are a sub-type of complex numbers. A number can always be moved up in the tower of types. An integer is just a special case of a complex number, for instance. So when adding an integer to a complex number, one only needs to convert from the integer to complex. This can be done in steps, so for n types only n-1 convertors are needed; IntegerToRational, RationalToReal, etc.
Typing systems are not always this simple. For example, a diagonal matrix is a sub-type of the generic set of all matrices, but some diagonal matrices are also sub-types of the unitary matrices. So a matrix can have more than one super-type. Super-types can have multiple sub-types too. In this case, choosing the right conversion is not trivial. The choice can be based on some heuristic; when doing a specific type of operation, first try to convert the matrix to a type for which there is the most efficient algorithm for performing that action, otherwise try another type.
It gets worse; for polynomials, x^2-y can be a multivariate polynomial in (x,y) with integer coefficients, or a univariate polynomial in x with coefficients which are univariate polynomials in y, or a univariate polynomial in y with coefficients which are univariate polynomials in x. There might be cases where the system can not decide which type to take, and the user will have to help the system by telling what type to take.
Types can be down-cast to sub-types too, if possible. This is sometimes necessary. Suppose one has two objects of type OBJECT, as entered by the user. The user wants to perform a division operation. The system can try to drop the object to the lowest possible type, by first trying to convert to integer, and if this doesn't work, see if the objects can be converted to objects of type univariate polynomial, and otherwise multivariate polynomial, etc. The trick is to have a routine that accepts univariate polynomials in internal representation, performs the operation, and returns the result of the operation in internal representation. When the operation is called with generic objects as arguments, if they can be converted to internal representation for univariate polynomials first, this can be done, and the result of the operation converted back to generic object type (a form the user could have typed in). The advantage of this is an object is converted once, when passed to a function. The result of consecutive function calls can then use the object in the type it is already in, without having to convert types each time, and the final result is converted back. This way types are only converted twice; at the beginning and at the end.
Interestingly, often when dealing with mathematical objects, there are properties that always need to hold for such objects. In the case of make-complex, real-part and imag-part, for a complex number z, the following should always be true:
(equal z (make-complex (real-part z) (imag-part z))) |
Data structures are a powerful way to abstract types away. A data structure can be defined for for instance rational numbers, matrices, etc. Also cons, first and rest are such a combination, where cons is the constructor, and first and rest are the selectors. The result of evaluating (cons a b) is an object for which first returns a, and rest returns b.
Suppose we decide to represent complex numbers as (complex real imag). Then the following routines are constructors and selectors:
(defun make-complex (r i) (list 'complex r i) ) (defun real-part (z) (second z)) (defun imag-part (z) (third z)) |
Then this allows for the following interaction:
In> (setf z (make-complex 2 3)) Out> z In> (real-part z) Out> 2 In> (imag-part z) Out> 3 |
We can define a predicate to check if an object is complex as:
(defun complexp(z) (and (listp z)(equal 'complex (first z))) ) |
And with that a simple addition operation:
(defun add (x y) (cond ( (integerp x) (+ x y) ) ( (complexp x) (make-complex (add (real-part x) (real-part y)) (add (imag-part x) (imag-part y)) ) ) ) ) |
which allows for the interaction:
In> (add (make-complex 2 3) (make-complex 4 5)) Out> (complex 6 8 ) |
This is an example of data-driven programming; invoking different routines for different data types. Also note we use the add operation for the components of the complex data object. This has powerful consequences: it means that a complex object can have components of arbitrary type, as long as there is an addition operation defined for it in the add function. The complex data object is in effect a parametric type: (complex <type> <type>), where <type> can be anything, integers, matrices, etc. <type> is a type parameter for this complex data type.
Types in general aid in two ways: to check for correctness of code, and to give information to the compiler to generate more efficient code. This also holds for parametric types. Suppose we have a vector object. Pseudo-code for adding the two vectors could look like this:
vector add-vectors(v1,v2) vector result = zero-vector for i from 1 to degree(v1,v2) result[i] := add(v1[i],v2[i]) return result |
Note we use the add operation here for adding the components of a vector. Suppose the add operation was then defined as:
object add(x,y) if (object-type(x) = integer) return add-integers(x,y) if (object-type(x) = real) return add-reals(x,y) if (object-type(x) = complex) return add-complex(x,y) if (object-type(x) = vector) return add-vectors(x,y) |
Now the important thing to notice is that the add function is called from within the loop of add-vectors, and if the vectors in question contain a billion items, add gets called a billion times. However, add checks the type of the object to be added each time it is called, and it is called from the inner loop! The add-vectors function can be written with the add operation inlined into it:
vector add-vectors(v1,v2) vector result = zero-vector for i from 1 to degree(v1,v2) if (object-type(v1[i]) = integer) result[i] = add-integers(v1[i],v2[i]) if (object-type(v1[i]) = real) result[i] = add-reals(v1[i],v2[i]) if (object-type(v1[i]) = complex) result[i] = add-complex(v1[i],v2[i]) if (object-type(v1[i]) = vector) result[i] = add-vectors(v1[i],v2[i]) return result |
However, because all the type of all the vector components can be expected to be the same, this can be improved by pulling the if statements outside of the loop, so the if statements are only performed once in stead of a billion times. This is a common way to optimize code, and is often done by hand. There is no reason, however, why a computer could not do this optimization. When the add operation is inlined, if the compiler knows that the if statements will yield the same results every time, it can pull them outside of the loop, as follows:
vector add-vectors(v1,v2) vector result = zero-vector if (object-type(v1[i]) = integer) for i from 1 to degree(v1,v2) result[i] = add-integers(v1[i],v2[i]) if (object-type(v1[i]) = real) for i from 1 to degree(v1,v2) result[i] = add-reals(v1[i],v2[i]) if (object-type(v1[i]) = complex) for i from 1 to degree(v1,v2) result[i] = add-complex(v1[i],v2[i]) if (object-type(v1[i]) = vector) for i from 1 to degree(v1,v2) result[i] = add-vectors(v1[i],v2[i]) return result |
If a compiler knows beforehand that two vectors of type vector<integer>, it can devise a special function that will only work on objects of type vector<integer>, but this is already a lot less useful. The if statements have already been pulled out of the loop, so they will not be performed a billion times for an addition of two vectors both with a billion components.
Parametric types can be supported naturally. A vector could be represented as (vector <type> <components>), so that (vector SMALLFLOAT (1 0 0)) would be a base unit vector in a three-dimensional space, with real-valued vector components. Flexibility is not lost, the type can be EXPRESSION or OBJECT or some other generic type. But it can also be (UNIPOLY INTEGER x), eg. making the vector a vector with univariate polynomials in x over the integers as components.
The above considerably speeds up the operation. The problem is that it is more natural to place the if statements in the add function, so it can be used in many places, when adding complex numbers, matrices, etc. But inlining it allows the compiler to optimize the if statements away by pulling them outside of the loop. In most cases, even if not in a loop, the compiler can optimize code away if inlined, by using typing information.
Inlining functions as described above is automatically done on macros. However, macros are slow to execute in the interpreter, so a possible feature of a compiler could be to tell it a function can be inlined. The variable scoping rules can still hold; the body should not be allowed to access variables in its calling environment.