These are some notes I (Armin) took after a talk at Chalmers by Steve Zdancewic: "Encoding Information Flow in Haskell". That talk was presenting a pure Haskell approach with monad-like constructions; I think that the approach translates well to PyPy at the level of RPython.
The problem that we try to solve here is: how to give the programmer a way to write programs that are easily checked to be "secure", in the sense that bugs shouldn't allow confidential information to be unexpectedly leaked. This is not security as in defeating actively malicious attackers.
Let's suppose that we want to write a telnet-based application for a bidding system. We want normal users to be able to log in with their username and password, and place bids (i.e. type in an amount of money). The server should record the highest bid so far but not allow users to see that number. Additionally, the administrator should be able to log in with his own password and see the highest bid. The basic program:
def mainloop(): while True: username = raw_input() password = raw_input() user = authenticate(username, password) if user == 'guest': serve_guest() elif user == 'admin': serve_admin() def serve_guest(): global highest_bid print "Enter your bid:" n = int(raw_input()) if n > highest_bid: # highest_bid = n # print "Thank you" def serve_admin(): print "Highest big is:", highest_bid
The goal is to make this program more secure by declaring and enforcing the following properties: first, the guest code is allowed to manipulate the highest_bid, as in the lines marked with #, but these lines must not leak back the highest_bid in a form visible to the guest user; second, the printing in serve_admin() must only be allowed if the user that logged in is really the administrator (e.g. catch bugs like accidentally swapping the serve_guest() and serve_admin() calls in mainloop()).
The basic technique to prevent leaks is to attach "confidentiality level" tags to objects. In this example, the highest_bid int object would be tagged with label="secret", e.g. by being initialized as:
highest_bid = tag(0, label="secret")
At first, we can think about an object space where all objects have such a label, and the label propagates to operations between objects: for example, code like highest_bid += 1 would produce a new int object with again label="secret".
Where this approach doesn't work is with if/else or loops. In the above example, we do:
if n > highest_bid: ...
However, by the object space rules introduced above, the result of the comparison is a "secret" bool objects. This means that the guest code cannot know if it is True or False, and so the PyPy interpreter has no clue if it must following the then or else branch of the if. So the guest code could do highest_bid += 1 and probably even highest_bid = max(highest_bid, n) if max() is a clever enough built-in function, but clearly this approach doesn't work well for more complicated computations that we would like to perform at this point.
There might be very cool possible ideas to solve this with doing some kind of just-in-time flow object space analysis. However, here is a possibly more practical approach. Let's forget about the object space tricks and start again. (See Related work for why the object space approach doesn't work too well.)
Suppose that the program runs on top of CPython and not necessarily PyPy. We will only need PyPy's annotator. The idea is to mark the code that manipulates highest_bid explicitly, and make it RPython in the sense that we can take its flow space and follow the calls (we don't care about the precise types here -- we will use different annotations). Note that only the bits that manipulates the secret values needs to be RPython. Example:
# on top of CPython, 'hidden' is a type that hides a value without # giving any way to normal programs to access it, so the program # cannot do anything with 'highest_bid' highest_bid = hidden(0, label="secure") def enter_bid(n): if n > highest_bid.value: highest_bid.value = n enter_bid = secure(enter_bid) def serve_guest(): print "Enter your bid:" n = int(raw_input()) enter_bid(n) print "Thank you"
The point is that the expression highest_bid.value raises a SecurityException when run normally: it is not allowed to read this value. The secure() decorator uses the annotator on the enter_bid() function, with special annotations that I will describe shortly. Then secure() returns a "compiled" version of enter_bid. The compiled version is checked to satisfy the security constrains, and it contains special code that then enables the highest_bid.value to work.
The annotations propagated by secure() are SomeSecurityLevel annotations. Normal constants are propagated as SomeSecurityLevel("public"). The highest_bid.value returns the annotation SomeSecurityLevel("secret"), which is the label of the constant highest_bid hidden object. We define operations between two SomeSecurityLevels to return a SomeSecurityLevel which is the max of the secret levels of the operands.
The key point is that secure() checks that the return value is SomeSecurityLevel("public"). It also checks that only SomeSecurityLevel("public") values are stored e.g. in global data structures.
In this way, any CPython code like serve_guest() can safely call enter_bid(n). There is no way to leak information about the current highest bid back out of the compiled enter_bid().
Now there must be a controlled way to leak the highest_bid value, otherwise it is impossible even for the admin to read it. Note that serve_admin(), which prints highest_bid, is considered to "leak" this value because it is an input-output, i.e. it escapes the program. This is a leak that we actually want -- the terminology is that serve_admin() must "declassify" the value.
To do this, there is a capability-like model that is easy to implement for us. Let us modify the main loop as follows:
def mainloop(): while True: username = raw_input() password = raw_input() user, priviledge_token = authenticate(username, password) if user == 'guest': serve_guest() elif user == 'admin': serve_admin(priviledge_token) del priviledge_token # make sure nobody else uses it
The idea is that the authenticate() function (shown later) also returns a "token" object. This is a normal Python object, but it should not be possible for normal Python code to instantiate such an object manually. In this example, authenticate() returns a priviledge("public") for guests, and a priviledge("secret") for admins. Now -- and this is the insecure part of this scheme, but it is relatively easy to control -- the programmer must make sure that these priviledge_token objects don't go to unexpected places, particularly the "secret" one. They work like capabilities: having a reference to them allows parts of the program to see secret information, of a confidentiality level up to the one corresponding to the token.
Now we modify serve_admin() as follows:
- def serve_admin(token):
- print "Highest big is:", declassify(highest_bid, token=token)
The declassify() function reads the value if the "token" is priviledged enough, and raises an exception otherwise.
What are we protecting here? The fact that we need the administrator token in order to see the highest bid. If by mistake we swap the serve_guest() and serve_admin() lines in mainloop(), then what occurs is that serve_admin() would be called with the guest token. Then declassify() would fail. If we assume that authenticate() is not buggy, then the rest of the program is safe from leak bugs.
There are another variants of declassify() that are convenient. For example, in the RPython parts of the code, declassify() can be used to control more precisely at which confidentiality levels we want which values, if there are more than just two such levels. The "token" argument could also be implicit in RPython parts, meaning "use the current level"; normal non-RPython code always runs at "public" level, but RPython functions could run with higher current levels, e.g. if they are called with a "token=..." argument.
(Do not confuse this with what enter_bid() does: enter_bid() runs at the public level all along. It is ok for it to compute with, and even modify, the highest_bid.value. The point of enter_bid() was that by being an RPython function the annotator can make sure that the value, or even anything that gives a hint about the value, cannot possibly escape from the function.)
It is also useful to have "globally trusted" administrator-level RPython functions that always run at a higher level than the caller, a bit like Unix programs with the "suid" bit. If we set aside the consideration that it should not be possible to make new "suid" functions too easily, then we could define the authenticate() function of our server example as follows:
def authenticate(username, password): database = {('guest', 'abc'): priviledge("public"), ('admin', '123'): priviledge("secret")} token_obj = database[username, password] return username, declassify(token_obj, target_level="public") authenticate = secure(authenticate, suid="secret")
The "suid" argument makes the compiled function run on level "secret" even if the caller is "public" or plain CPython code. The declassify() in the function is allowed because of the current level of "secret". Note that the function returns a "public" tuple -- the username is public, and the token_obj is declassified to public. This is the property that allows CPython code to call it.
Of course, like a Unix suid program the authenticate() function could be buggy and leak information, but like suid programs it is small enough for us to feel that it is secure just by staring at the code.
An alternative to the suid approach is to play with closures, e.g.:
def setup(): #initialize new levels -- this cannot be used to access existing levels public_level = create_new_priviledge("public") secret_level = create_new_priviledge("secret") database = {('guest', 'abc'): public_level, ('admin', '123'): secret_level} def authenticate(username, password): token_obj = database[username, password] return username, declassify(token_obj, target_level="public", token=secret_level) return secure(authenticate) authenticate = setup()
In this approach, declassify() works because it has access to the secret_level token. We still need to make authenticate() a secure() compiled function to hide the database and the secret_level more carefully; otherwise, code could accidentally find them by inspecting the traceback of the KeyError exception if the username or password is invalid. Also, secure() will check for us that authenticate() indeed returns a "public" tuple.
This basic model is easy to extend in various directions. For example secure() RPython functions should be allowed to return non-public results -- but then they have to be called either with an appropriate "token=..." keyword, or else they return hidden objects again. They could also be used directly from other RPython functions, in which the level of what they return is propagated.
What I'm describing here is nothing more than an adaptation of existing techniques to RPython.
It is noteworthy to mention at this point why the object space approach doesn't work as well as we could first expect. The distinction between static checking and dynamic checking (with labels only attached to values) seems to be well known; also, it seems to be well known that the latter is too coarse in practice. The problem is about branching and looping. From the object space' point of view it is quite hard to know what a newly computed value really depends on. Basically, it is difficult to do better than: after is_true() has been called on a secret object, then we must assume that all objects created are also secret because they could depend in some way on the truth-value of the previous secret object.
The idea to dynamically use static analysis is the key new idea presented by Steve Zdancewic in his talk. You can have small controlled RPython parts of the program that must pass through a static analysis, and we only need to check dynamically that some input conditions are satisfied when other parts of the program call the RPython parts. Previous research was mostly about designing languages that are completely statically checked at compile-time. The delicate part is to get the static/dynamic mixture right so that even indirect leaks are not possible -- e.g. leaks that would occur from calling functions with strange arguments to provoke exceptions, and where the presence of the exception or not would be information in itself. This approach seems to do that reliably. (Of course, at the talk many people including the speaker were wondering about ways to move more of the checking at compile-time, but Python people won't have such worries :-)