previousTable of Contentsnext
Appendix

Appendix C: PyTables File Format

PyTables has a powerful capability to deal with native HDF5 files created with another tools. However, there are situations were you may want to create truly native PyTables files with those tools while retaining fully compatibility with PyTables format. That is perfectly possible, and in this appendix is presented the format that you should endow to your own-generated files in order to get a fully PyTables compatible file.

We are going to describe the 1.3 version of PyTables file format (introduced in PyTables version 0.9). At this stage, this file format is considered stable enough to do not introduce significant changes during a reasonable amount of time. As time goes by, some changes will be introduced (and documented here) in order to cope with new necessities. However, the changes will be carefully analyzed so as to ensure backward compatibility whenever is possible.

A PyTables file is composed with arbitrarily large amounts of HDF5 groups (Groups in PyTables naming scheme) and datasets (Leaves in PyTables naming scheme). For groups, the only requirements are that they must have some system attributes available. By convention, system attributes in PyTables are written in upper case, and user attributes in lower case but this is not enforced by the software. In the case of datasets, besides the mandatory system attributes, some conditions are further needed in their storage layout, as well as in the datatypes used in there, as we will see shortly.

As a final remark, you can use any filter as you want to create a PyTables file, provided that the filter is a standard one in HDF5, like zlib, shuffle or szip (although the last one cannot be used from within PyTables to create a new file, datasets compressed with szip can be read, because it is the HDF5 library which do the decompression transparently).

C.1 Mandatory attributes for a File

The File object is, in fact, an special HDF5 group structure that is root for the rest of the objects on the object tree. The next attributes are mandatory for the HDF5 root group structure in PyTables files:

CLASS
This attribute should always be set to 'GROUP' for group structures.
PYTABLES_FORMAT_VERSION
It represents the internal format version, and currently should be set to the '1.2' string.
TITLE
A string where the user can put some description on what is this group used for.
VERSION
Should contains the string '1.0'.

C.2 Mandatory attributes for a Group

The next attributes are mandatory for group structures:

CLASS
This attribute should always be set to 'GROUP' for group structures.
TITLE
A string where the user can put some description on what is this group used for.
VERSION
Should contains the string '1.0'.

There exist a special Group, called the root, that, in addition to the attributes listed above, it requires the next one:

PYTABLES_FORMAT_VERSION
It represents the internal format version, and currently should be set to the '1.3' string.

C.3 Mandatory attributes, storage layout and supported datatypes for Leaves

This depends on the kind of Leaf. The format for each type follows.

C.3.1 Table format

Mandatory attributes

The next attributes are mandatory for table structures:

CLASS
Must be set to 'TABLE'.
TITLE
A string where the user can put some description on what is this dataset used for.
VERSION
Should contain the string '2.2'.
FIELD_X_NAME
It contains the names of the different fields. The X means the number of the field (beware, order do matter). You should add as many attributes of this kind as fields you have in your records.
NROWS
This should contain the number of compound datatype entries in the dataset. It must be an int datatype.

Storage Layout

A Table has a dataspace with a 1-dimensional chunked layout.

Datatypes supported

The datatype of the elements (rows) of Table must be the H5T_COMPOUND compound datatype, and each of these compound components must be built with only the next HDF5 datatypes classes:

H5T_BITFIELD
This class is used to represent the Bool type. Such a type must be build using a H5T_NATIVE_B8 datatype, followed by a HDF5 H5Tset_precision call to set its precision to be just 1 bit.
H5T_INTEGER
This includes the next datatypes:
H5T_NATIVE_SCHAR
This represents a signed char C type, but it is effectively used to represent an Int8 type.
H5T_NATIVE_UCHAR
This represents an unsigned char C type, but it is effectively used to represent an UInt8 type.
H5T_NATIVE_SHORT
This represents a short C type, and it is effectively used to represent an Int16 type.
H5T_NATIVE_USHORT
This represents an unsigned short C type, and it is effectively used to represent an UInt16 type.
H5T_NATIVE_INT
This represents an int C type, and it is effectively used to represent an Int32 type.
H5T_NATIVE_UINT
This represents an unsigned int C type, and it is effectively used to represent an UInt32 type.
H5T_NATIVE_LONG
This represents a long C type, and it is effectively used to represent an Int32 or an Int64, depending on whether you are running a 32-bit or 64-bit architecture.
H5T_NATIVE_ULONG
This represents an unsigned long C type, and it is effectively used to represent an UInt32 or an UInt64, depending on whether you are running a 32-bit or 64-bit architecture.
H5T_NATIVE_LLONG
This represents a long long C type (__int64, if you are using a Windows system) and it is effectively used to represent an Int64 type.
H5T_NATIVE_ULLONG
This represents an unsigned long long C type (beware: this type does not have a correspondence on Windows systems) and it is effectively used to represent an UInt64 type.
H5T_FLOAT
This includes the next datatypes:
H5T_NATIVE_FLOAT
This represents a float C type and it is effectively used to represent an Float32 type.
H5T_NATIVE_DOUBLE
This represents a double C type and it is effectively used to represent an Float64 type.
H5T_STRING
The datatype used to describe strings in PyTables is H5T_C_S1 (i.e. a string C type) followed with a call to the HDF5 H5Tset_size() function to set their length.
H5T_ARRAY
This allows the construction of homogeneous, multi-dimensional arrays, so that you can include such objects in compound records. The types supported as elements of H5T_ARRAY datatypes are the ones described above. Currently, PyTables does not support nested H5T_ARRAY types.
H5T_COMPOUND
This allows the support of complex numbers. Its format is described below:

The H5T_COMPOUND type class contains two members. Both members must have the H5T_FLOAT atomic datatype class. The name of the first member should be "r" and represents the real part. The name of the second member should be "i" and represents the imaginary part. The precision property of both of the H5T_FLOAT members must be either 32 significant bits (e.g. H5T_NATIVE_FLOAT) or 64 significant bits (e.g. H5T_NATIVE_DOUBLE). They represent Complex32 and Complex64 types respectively.

Currently, PyTables does not support nested H5T_COMPOUND types, the only exception being supporting complex numbers in Table objects as described above.

C.3.2 Array format

Mandatory attributes

The next attributes are mandatory for array structures:

CLASS
Must be set to 'ARRAY'.
FLAVOR
This is meant to provide the information about the kind of object kept in the Array, i.e. when the dataset is read, it will be converted to the indicated flavor. It can take one the next string values:
"NumArray"
The dataset will be returned as a NumArray object (from the numarray package).
"CharArray"
The dataset will be returned as a CharArray object (from the numarray package).
"Numeric"
The dataset will be returned as an array object (from the Numeric package).
"List"
The dataset will be returned as a Python List object.
"Tuple"
The dataset will be returned as a Python Tuple object.
"Int"
The dataset will be returned as a Python Int object. This is meant mainly for scalar (i.e. without dimensions) integer values.
"Float"
The dataset will be returned as a Python Float object. This is meant mainly for scalar (i.e. without dimensions) floating point values.
"String"
The dataset will be returned as a Python String object. This is meant mainly for scalar (i.e. without dimensions) string values.
TITLE
A string where the user can put some description on what is this dataset used for.
VERSION
Should contain the string '2.1'.

Storage Layout

An Array has a dataspace with a N-dimensional contiguous layout (if you prefer a chunked layout see EArray below).

Datatypes supported

The elements of Array must have either HDF5 atomic datatypes or a compound datatype representing a complex number. The atomic datatypes can currently be one of the next HDF5 datatype classes: H5T_BITFIELD, H5T_INTEGER, H5T_FLOAT and H5T_STRING. See the Table format description in section C.3.1 for more info about these types.

In addition to the HDF5 atomic datatypes, the Array format supports complex numbers with the H5T_COMPOUND datatype class. See the Table format description in section C.3.1 for more info about this special type.

You should note that H5T_ARRAY class datatypes are not allowed in Array objects.

C.3.3 EArray format

Mandatory attributes

The next attributes are mandatory for earray structures:

CLASS
Must be set to 'EARRAY'.
EXTDIM
(Integer) Must be set to the extensible dimension. Only one extensible dimension is supported right now.
FLAVOR
This is meant to provide the information about the kind of objects kept in the EArray, i.e. when the dataset is read, it will be converted to the indicated flavor. It can take the same values as the Array object (see C.3.2), except "Int" and "Float".
TITLE
A string where the user can put some description on what is this dataset used for.
VERSION
Should contain the string '1.1'.

Storage Layout

An EArray has a dataspace with a N-dimensional chunked layout.

Datatypes supported

The elements of EArray are allowed to have the same datatypes as for the elements in the Array format. They can be one of the HDF5 atomic datatype classes: H5T_BITFIELD, H5T_INTEGER, H5T_FLOAT or H5T_STRING, see the Table format description in section C.3.1 for more info about these types. They can also be a H5T_COMPOUND datatype representing a complex number, see the Table format description in section C.3.1.

You should note that H5T_ARRAY class datatypes are not allowed in EArray objects.

C.3.4 VLArray format

Mandatory attributes

The next attributes are mandatory for vlarray structures:

CLASS
Must be set to 'VLARRAY'.
FLAVOR
This is meant to provide the information about the kind of objects kept in the VLArray, i.e. when the dataset is read, it will be converted to the indicated flavor. It can take one of the next values:
"NumArray"
The elements in dataset will be returned as NumArray objects (from the numarray package).
"CharArray"
The elements in dataset will be returned as CharArray objects (from the numarray package).
"String"
The elements in the dataset will be returned as Python String objects of fixed length (and not as CharArrays).
"Numeric"
The elements in the dataset will be returned as array objects (from the Numeric package).
"List"
The elements in the dataset will be returned as Python List objects.
"Tuple"
The elements in the dataset will be returned as Python Tuple objects.
"Object"
The elements in the dataset will be interpreted as pickled (i.e. serialized objects through the use of the Pickle Python module) objects and returned as Python generic objects. Only one of such objects will be supported per entry. As the Pickle module is not normally available in other languages, this flavor won't be useful in general.
"VLString"
The elements in the dataset will be returned as Python String objects of any length, with the twist that Unicode strings are supported as well (provided you use the UTF-8 codification, see below). However, only one of such objects will be supported per entry.
TITLE
A string where the user can put some description on what is this dataset used for.
VERSION
Should contain the string '1.1'.

Storage Layout

An VLArray has a dataspace with a 1-dimensional chunked layout.

Datatypes supported

The datatype of the elements (rows) of VLArray objects must be the H5T_VLEN variable-length (or VL for short) datatype, and the base datatype specified for the VL datatype can be of any atomic HDF5 datatype that is listed in the Table format description section C.3.1. That includes the classes:

  • H5T_BITFIELD
  • H5T_INTEGER
  • H5T_FLOAT
  • H5T_STRING
  • H5T_ARRAY

They can also be a H5T_COMPOUND datatype representing a complex number, see the Table format description in section C.3.1 for a detailed description.

You should note that this does not include another VL datatype, or a compound datatype that does not fit the description of a complex number. Note as well that, for Object and VLString special flavors, the base for the VL datatype is always a H5T_NATIVE_UCHAR. That means that the complete row entry in the dataset has to be used in order to fully serialize the object or the variable length string.

In addition, if you plan to use a VLString flavor for your text data and you are using ascii-7 (7 bits ASCII) codification for your strings, but you don't know (or just don't want) to convert it to the required UTF-8 codification, you should not worry too much about that because the ASCII characters with values in the range [0x00, 0x7f] are directly mapped to Unicode characters in the range [U+0000, U+007F] and the UTF-8 encoding has the useful property that an UTF-8 encoded ascii-7 string is indistinguishable from a traditional ascii-7 string. So, you will not need any further conversion in order to save your ascii-7 strings and have an VLString flavor.


previousTable of Contentsnext