The goal of PyTables is to enable the end user to manipulate easily data tables and array objects in a hierarchical structure. The foundation of the underlying hierarchical data organization is the excellent HDF5 library (see ).
It should be noted that this package is not intended to serve as a complete wrapper for the entire HDF5 API, but only to provide a flexible, very Pythonic tool to deal with (arbitrarily) large amounts of data (typically bigger than available memory) in tables and arrays organized in a hierarchical and persistent disk storage structure.
A table is defined as a collection of records whose values are stored in fixed-length fields. All records have the same structure and all values in each field have the same data type. The terms fixed-length and strict data types may seem to be a strange requirement for an interpreted language like Python, but they serve a useful function if the goal is to save very large quantities of data (such as is generated by many data aquisition systems, Internet services or scientific applications, for example) in an efficient manner that reduces demand on CPU time and I/O.
In order to emulate in Python records mapped to HDF5 C structs PyTables implements a special metaclass object so as to easily define all its fields and other properties. PyTables also provides a powerful interface to mine data in tables. Records in tables are also known in the HDF5 naming scheme as compound data types.
For example, you can define arbitrary tables in Python simply by declaring a class with name field and types information, such as in the following example:
class Particle(IsDescription): name = StringCol(16) # 16-character String idnumber = Int64Col() # Signed 64-bit integer ADCcount = UInt16Col() # Unsigned short integer TDCcount = UInt8Col() # unsigned byte grid_i = Int32Col() # integer grid_j = IntCol() # integer (equivalent to Int32Col) pressure = Float32Col(shape=(2,3)) # 2-D float array (single-precision) energy = FloatCol(shape=(2,3,4)) # 3-D float array (double-precision)
You then pass this class to the table constructor, fill its rows with your values, and save (arbitrarily large) collections of them to a file for persistent storage. After that, the data can be retrieved and post-processed quite easily with PyTables or even with another HDF5 application (in C, Fortran, Java or whatever language that provides a library to interface with HDF5).
Other important entities in PyTables are the array objects that are analogous to tables with the difference that all of their components are homogeneous. They come in different flavors, like generic (they provide a quick and fast way to deal with for numerical arrays), enlargeable (arrays can be extended in any single dimension) and variable length (each row in the array can have a different number of elements).
The next section describes the most interesting capabilities of PyTables.
PyTables takes advantage of the object orientation and introspection capabilities offered by Python, the HDF5 powerful data management features and numarray flexibility and high-performance manipulation of large sets of objects organized in grid-like fashion to provide these features:
The hierarchical model of the underlying HDF5 library allows PyTables to manage tables and arrays in a tree-like structure. In order to achieve this, an object tree entity is dynamically created imitating the HDF5 structure on disk. The HDF5 objects are read by walking through this object tree. You can get a good picture of what kind of data is kept in the object by examining the metadata nodes.
The different nodes in the object tree are instances of PyTables classes. There are several types of classes, but the most important ones are the Group and the Leaf classes. Group instances (referred to as groups from now on) are a grouping structure containing instances of zero or more groups or leaves, together with supplementary metadata. Leaf instances (referred to as leaves) are containers for actual data and cannot contain further groups or leaves. The Table, Array, EArray, VLArray and UnImplemented classes are descendents of Leaf, and inherit all its properties.
Working with groups and leaves is similar in many ways to working with directories and files on a Unix filesystem. As is the case with Unix directories and files, objects in the object tree are often described by giving their full (or absolute) path names. In PyTables this full path can be specified either as string (such as '/subgroup2/table3') or as a complete object path written in a format known as the natural name schema (such as file.root.subgroup2.table3).
Support for natural naming is a key aspect of PyTables. It means that the names of instance variables of the node objects are the same as the names of the element's children1). This is very Pythonic and intuitive in many cases. Check the tutorial section 3.1.6 for usage examples.
You should also be aware that not all the data present in a file is loaded into the object tree. Only the metadata (i.e. special data that describes the structure of the actual data) is loaded. The actual data is not read until you request it (by calling a method on a particular node). Using the object tree (the metadata) you can retrieve information about the objects on disk such as table names, titles, name columns, data types in columns, numbers of rows, or, in the case of arrays, the shapes, typecodes, etc. of the array. You can also search through the tree for specific kinds of data then read it and process it. In a certain sense, you can think of PyTables as a tool that applies the same introspection capabilities of Python objects to large amounts of data in persistent storage.
To better understand the dynamic nature of this object tree entity, let's start with a sample PyTables script (you can find it in examples/objecttree.py) to create a HDF5 file:
from tables import * class Particle(IsDescription): identity = StringCol(length=22, dflt=" ", pos = 0) # character String idnumber = Int16Col(1, pos = 1) # short integer speed = Float32Col(1, pos = 2) # single-precision # Open a file in "w"rite mode fileh = openFile("objecttree.h5", mode = "w") # Get the HDF5 root group root = fileh.root # Create the groups: group1 = fileh.createGroup(root, "group1") group2 = fileh.createGroup(root, "group2") # Now, create an array in the root group array1 = fileh.createArray(root, "array1", ["this is", "a string array"], "String array") # Create 2 new tables in group1 and group2 table1 = fileh.createTable(group1, "table1", Particle) table2 = fileh.createTable("/group2", "table2", Particle) # Create one more Array in group1 array2 = fileh.createArray("/group1", "array2", [1,2,3,4]) # Now, fill the tables: for table in (table1, table2): # Get the record object associated with the table: row = table.row # Fill the table with 10 records for i in xrange(10): # First, assign the values to the Particle record row['identity'] = 'This is particle: %2d' % (i) row['idnumber'] = i row['speed'] = i * 2. # This injects the Record values row.append() # Flush the table buffers table.flush() # Finally, close the file (this also will flush all the remaining buffers!) fileh.close()
This small program creates a simple HDF5 file called objecttree.h5 with the structure that appears in figure 1.1. When the file is created, the metadata in the object tree is updated in memory while the actual data is saved to disk. When you close the file the object tree is no longer available. However, when you reopen this file the object tree will be reconstructed in memory from the metadata on disk, allowing you to work with it in exactly the same way as when you originally created it.
In figure 1.2 you can see an example of the object tree created when the above objecttree.h5 file is read (in fact, such an object is always created when reading any supported generic HDF5 file). It's worthwhile to take your time to understand it2). It will help you to avoid programming mistakes.