On this chapter, you will get deeper knowledge of PyTables internals. PyTables has several places where the user can improve the performance of his application. If you are planning to deal with really large data, you should read carefully this section in order to learn how to get an important boost for your code. But if your dataset is small or medium size (say, up to 10 MB), you should not worry about that as the default parameters in PyTables are already tuned to handle that perfectly.
The underlying HDF5 library that is used by PyTables takes the data in bunches of a certain length, so-called chunks, to write them on disk as a whole, i.e. the HDF5 library treats chunks as atomic objects and disk I/O is always made in terms of complete chunks. This allows data filters to be defined by the application to perform tasks such as compression, encryption, checksumming, etc. on entire chunks.
An in-memory B-tree is used to map chunk structures on disk. The more chunks that are allocated for a dataset the larger the B-tree. Large B-trees take memory and cause file storage overhead as well as more disk I/O and higher contention for the metadata cache. Consequently, it's important to balance between memory and I/O overhead (small B-trees) and time to access data (big B-trees).
PyTables can determine an optimum chunk size to make B-trees adequate to your dataset size if you help it by providing an estimation of the number of rows for a table. This must be made in table creation time by passing this value in the expectedrows keyword of createTable method (see 4.2.2).
When your table size is bigger than 10 MB (take this figure only as a reference, not strictly), by providing this guess of the number of rows you will be optimizing the access to your data. When the table size is larger than, say 100MB, you are strongly suggested to provide such a guess; failing to do that may cause your application doing very slow I/O operations and demanding huge amounts of memory. You have been warned!.
row = table.row result = [ row['var2'] for row in table if row['var1'] <= 20 ](for future reference, we will call this the standard selection mode) and want to improve the time taken by it, keep reading.
row = table result = [ row['var2'] for row in table.where(table.cols.var1 <= 20)]This simple change of mode selection can account for an improvement in search times up to a factor of 10 (see the figure 6.1).
So, where is the trick?. It's easy. In the standard selection mode the data for column var1 has to be carried up to Python space so as to evaluate the condition and decide if the var2 value should be added to the result list. On the contrary, in the in-kernel mode, the condition is passed to the PyTables kernel, written in C (hence the name), and evaluated there at C speed (with some help of the numarray package), so that the only values that were brought to the Python space where the references for rows that fulfilled the condition.
row = table result = [ row['var2'] for row in table.where(table.cols.var3 == "foo")] if row['var1'] <= 20 ]here, we have used a in-kernel selection to filter the rows whose var3 field is equal to string "foo". Then, we apply a standard selection to complete the query.
Of course, when you mix the in-kernel and standard selection modes you should pass the most restrictive condition to the in-kernel part, i.e. to the where iterator. In situations where it is not clear which is the most restrictive condition, you might want to experiment a bit in order to find the best combination.
When you need more speed than in-kernel selections can offer you, PyTables offer a third selection method, so-called indexed mode. On this mode, you have to decide which column(s) are you going to do your selections and index them. Indexing is just a kind of sort operation, so that next searches along a column will look at the sorted information using a binary search which is much faster than a sequential search.
You can index your selected columns in several ways:
class Example(IsDescription): var1 = StringCol(length=4, dflt="", pos=1, indexed=1) var2 = BoolCol(0, indexed=1, pos = 2) var3 = IntCol(0, indexed=1, pos = 3) var4 = FloatCol(0, indexed=0, pos = 4)In this case, we are telling that var1, var2 and var3 columns will be indexed automatically when you add rows to the table with this description.
indexrows = table.cols.var1.createIndex() indexrows = table.cols.var2.createIndex() indexrows = table.cols.var3.createIndex()will create indexes for all var1, var2 and var3 columns, and after doing that, they will behave as regular indexes.
row = table result = [ row['var2'] for row in table.where(table.cols.var1 == "foo")]or, if you want to add more conditions, you can mix the indexed selection with a standard one:
row = table result = [ row['var2'] for row in table.where(table.cols.var3 <= 20)] if row['var1'] == "foo" ]rememeber to pass the most restictive condition to the where iterator.
You can see in figures 6.1 and 6.2 that indexing can accelerate quite a lots your data selections in tables. For moderately large tables (> one million rows), you can see that you can achieve speed-ups in the order of 100x respect to in-kernel selections and in the order of 1000x respecte to standard selections.
One important aspect of indexation in PyTables is that it has been implemented with the goal of being capable to manage effectively very large tables. In figure 6.3, you can see that the times to index columns in tables always grows linearly. In particular, the time to index a couple of columns with 1 billion of rows each is 40 min. (roughly 20 min. each), which is a quite reasonable figure. This is because PyTables has choosed an algorithm that do a partial sorting of the columns in order to ensure that the indexing time grows linearly. On the contrary, most of relational databases try to do a complete sorting of columns, and this makes the time to index to grow quadratically with the number of rows.
The fact that relational databases uses a complete sorting algorithm for indexes means that their index would be more effective (but not by a large extent) for searching purposes that the PyTables approach. However, for relatively large tables (> 10 millions of rows) the time required for completing such a sort can be so large, that indexing is not normally worth the effort. In other words, PyTables indexing scales much better than relational databases. So, don't worry if you have extremely large columns to index: PyTables is designed to handle with that perfectly.
One of the beauties of PyTables is that it supports compression on tables and arrays9), although it is disabled by default. Compression of big amounts of data might be a bit controversial feature, because compression has a legend of being a very big CPU time resources consumer. However, if you are willing to check if compression can help not only reducing your dataset file size but also improving your I/O efficiency, keep reading.
There is an usual scenario where users need to save duplicated data in some record fields, while the others have varying values. In a relational database approach such redundant data can normally be moved to other tables and a relationship between the rows on the separate tables can be created. But that takes analysis and implementation time, and makes the underlying libraries more complex and slower.
PyTables transparent compression allows the users to not worry about finding which is their optimum data tables strategy, but rather use less, not directly related, tables with a larger number of columns while still not cluttering the database too much with duplicated data (compression is responsible to avoid that). As a side effect, data selections can be made more easily because you have more fields available in a single table, and they can be referred in the same loop. This process may normally end in a simpler, yet powerful manner to process your data (although you should still be careful about in which kind of scenarios compression use is convenient or not).
The compression library used by default is the Zlib (see ), and as HDF5 requires it, you can safely use it and expect that your HDF5 files will be readable on any other platform that has HDF5 libraries installed. Zlib provides good compression ratio, although somewhat slow, and reasonably fast decompression. Because of that, it is a good candidate to be used for compressing you data.
However, in many situations (i.e. write once, read multiple), it is critical to have very good decompression speed (at expense of whether less compression or more CPU wasted on compression, as we will see soon). This is why support for two additional compressors has been added to PyTables: LZO and UCL (see ). Following his author (and checked by the author of this manual), LZO offers pretty fast compression (although small compression ratio) and extremely fast decompression while UCL achieves an excellent compression ratio (at the price of spending much more CPU time) while allowing very fast decompression (and very close to the LZO one). In fact, LZO and UCL are so fast when decompressing that, in general (that depends on your data, of course), writing and reading a compressed table is actually faster (and sometimes much faster) than if it is uncompressed. This fact is very important, specially if you have to deal with very large amounts of data.
Be aware that the LZO and UCL support in PyTables is not standard on HDF5, so if you are going to use your PyTables files in other contexts different from PyTables you will not be able to read them. Still, see the appendix B.2 where the ptrepack utility is described to find a way to free your files from LZO or UCL dependencies, so that you can use these compressors locally with the warranty that you can replace them with ZLIB (or even remove compression completely) if you want to export the files to other HDF5 tools afterwards.
In order to give you a raw idea of what ratios would be achieved, and what resources would be consumed, look at the table 6.1. This table has been obtained from synthetic data and with a somewhat outdated PyTables version (0.5), so take this just as a guide because your mileage will probably vary. Have also a look at the graphs 6.4 and 6.5 (these graphs have been obtained with tables with different row sizes and PyTables version than the previous example, so do not try to directly compare the figures). They show how the speed of writing/reading rows evolves as the size (the row number) of tables grows. Even though in these graphs the size of one single row is 56 bytes, you can most probably extrapolate this figures to other row sizes. If you are curious about how well compression can perform together with Psyco, look at the graphs 6.6 and 6.7. As you can see, the results are pretty interesting.
Compr. Lib | File size (MB) | Time writing (s) | Time reading (s) | Speed writing (Krow/s) | Speed reading (Krow/s) |
---|---|---|---|---|---|
NO COMPR | 244.0 | 24.4 | 16.0 | 18.0 | 27.8 |
Zlib (lvl 1) | 8.5 | 17.0 | 3.11 | 26.5 | 144.4 |
Zlib (lvl 6) | 7.1 | 20.1 | 3.10 | 22.4 | 144.9 |
Zlib (lvl 9) | 7.2 | 42.5 | 3.10 | 10.6 | 145.1 |
LZO (lvl 1) | 9.7 | 14.6 | 1.95 | 30.6 | 230.5 |
UCL (lvl 1) | 6.9 | 38.3 | 2.58 | 11.7 | 185.4 |
By looking at graphs, you can expect that, generally speaking, LZO would be the fastest both compressing and uncompressing, but the one that achieves the worse compression ratio (although that may be just ok for many situations). UCL is the slowest when compressing, but is faster than Zlib when decompressing, and, besides, it achieves very good compression ratios (generally better than Zlib). Zlib represents a balance between them: it's somewhat slow compressing, the slowest during decompression, but it normally achieves fairly good compression ratios.
So, if your ultimate goal is reading as fast as possible, choose LZO. If you want to reduce as much as possible your data, while retaining good read speed, choose UCL. If you don't mind too much about the above parameters and/or portability is important for you, Zlib is your best bet.
The compression level that I recommend to use for all compression libraries is 1. This is the lowest level of compression, but if you take the approach suggested above, normally the redundant data is to be found in the same row, so the redundant data locality is very high and such a small level of compression should be enough to achieve a good compression ratio on your data tables, saving CPU cycles for doing other things. Nonetheless, in some situations you may want to check how compression level affects your application.
You can select the compression library and level by setting the complib and compress keywords in the Filters class (see 4.13.1). A compression level of 0 will completely disable compression (the default), 1 is the less CPU time demanding level, while 9 is the maximum level and most CPU intensive. Finally, have in mind that LZO is not accepting a compression level right now, so, when using LZO, 0 means that compression is not active, and any other value means that LZO is active.
The HDF5 library provides an interesting filter that can leverage the results of your favorite compressor. Its name is shuffle, and because it can greatly benefit compression and it doesn't take many CPU resources, it is active by default in PyTables whenever compression is activated (independently of the chosen compressor). It is of course deactivated when compression is off (which is the default, as you already should know).
From the HDF5 reference manual:
The shuffle filter de-interlaces a block of data by reordering the bytes. All the bytes from one consistent byte position of each data element are placed together in one block; all bytes from a second consistent byte position of each data element are placed together a second block; etc. For example, given three data elements of a 4-byte datatype stored as 012301230123, shuffling will re-order data as 000111222333. This can be a valuable step in an effective compression algorithm because the bytes in each byte position are often closely related to each other and putting them together can increase the compression ratio.
In table 6.2 you can see a benchmark that shows how the shuffle filter can help to the different libraries to compress data in three table datasets. Generally speaking, shuffle makes the writing process (shuffling+compressing) faster (between 7% and 22%), which is an interesting result in itself. However, the reading process (unshuffling+decompressing) is slower, but by a lesser extent (between 3% and 18%).
But the most remarkable fact is the level of compression that compressor filters can achieve after shuffle has passed over the data: the total file size can be up to 40 times smaller than the uncompressed file, and up to 5 times smaller than the already compressed files (!). Of course, the data for doing this test is synthetic, and shuffle seems to do a great work with it, so in general, the results will vary in your case. However, due to the small drawbacks (reads are slowed down by a small extent) and its potential gains (faster writing, but specially much better compression level), I do believe that it is a good thing to have such a filter enabled by default in the battle for discovering redundancy in your data.
Compr. Lib | File size (MB) | Time writing (s) | Time reading (s) | Speed writing (MB/s) | Speed reading (MB/s) |
---|---|---|---|---|---|
NO COMPR | 165.4 | 24.5 | 17.13 | 6.6 | 9.6 |
Zlib (lvl 1) | 26.4 | 22.2 | 5.77 | 7.3 | 28.4 |
Zlib+shuffle | 4.0 | 19.0 | 5.94 | 8.6 | 27.6 |
LZO (lvl 1) | 44.9 | 17.8 | 4.13 | 9.2 | 39.7 |
LZO+shuffle | 4.3 | 16.4 | 5.03 | 9.9 | 32.6 |
UCL (lvl 1) | 27.4 | 48.8 | 5.02 | 3.3 | 32.7 |
UCL+shuffle | 3.5 | 38.1 | 5.31 | 4.3 | 30.9 |
Psyco (see ) is a kind of specialized compiler for Python that typically accelerates Python applications with no change in source code. You can think of Psyco as a kind of just-in-time (JIT) compiler, a little bit like Java's, that emits machine code on the fly instead of interpreting your Python program step by step. The result is that your unmodified Python programs run faster.
Psyco is very easy to install and use, so in most scenarios it is worth to give it a try. However, it only runs on Intel 386 architectures, so if you are using other architectures, you are out of luck (at least until Psyco will support yours).
As an example, imagine that you have a small script that reads and selects data over a series of datasets, like this:
def readFile(filename): "Select data from all the tables in filename" fileh = openFile(filename, mode = "r") result = [] for table in fileh("/", 'Table'): result = [ p['var3'] for p in table if p['var2'] <= 20 ] fileh.close() return e if __name__=="__main__": print readFile("myfile.h5")
In order to accelerate this piece of code, you can rewrite your main program to look like:
if __name__=="__main__": import pysco psyco.bind(readFile) print readFile("myfile.h5")
That's all!. From now on, each time that you execute your Python script, Psyco will deploy its sophisticated algorithms so as to accelerate your calculations.
You can see in the graphs 6.8 and 6.9 how much I/O speed improvement you can get by using Psyco. By looking at this figures you can get an idea if these improvements are of your interest or not. In general, if you are not going to use compression you will take advantage of Psyco if your tables are medium sized (from a thousand to a million rows), and this advantage will disappear progressively when the number of rows grows well over one million. However if you use compression, you will probably see improvements even beyond this limit (see section 6.3). As always, there is no substitute for experimentation with your own dataset.
If you have a huge tree in your data file with many nodes on it, creating the object tree would take long time. Many times, however, you are interested only in access to a part of the complete tree, so you won't strictly need PyTables to build the entire object tree in-memory, but only the interesting part.
This is where the rootUEP parameter of openFile function (see 4.1.2) can be helpful. Imagine that you have a file called "test.h5" with the associated tree that you can see in figure 6.10, and you are interested only in the section marked in red. You can avoid the build of all the object tree by saying to openFile that your root will be the /Group2/Group3 group. That is:
fileh = openFile("test.h5", rootUEP="/Group2/Group3")
As a result, the actual object tree built will be like the one that can be seen in figure 6.11.
Of course this has been a simple example and the use of the rootUEP parameter was not very necessary. But when you have thousands of nodes on a tree, you will certainly appreciate the rootUEP parameter.
Let's suppose that you have a file on which you have made a lot of row deletions on one or more tables, or deleted many leaves or even entire subtrees. These operations migth leave holes (i.e. space that is not used anymore) in your files, that may potentially affect not only the size of the files but, more importantly, the performance of I/O. This is because when you delete a lot of rows on a table, the space is not automatically recovered on-the-flight. In addition, if you add many more rows to a table than specified in the expectedrows keyword in creation time this may affect performace as well, as explained in section 6.1.
In order to cope with these issues, you should be aware that a handy PyTables utility called ptrepack can be very useful, not only to compact your already existing leaky files, but also to adjust some internal parameters (both in memory and in file) in order to create adequate buffer sizes and chunk sizes for optimum I/O speed. Please, check the appendix B.2 for a brief tutorial on its use.
Another thing that you might want to use ptrepack for is changing the compression filters or compression levels on your existing data for different goals, like checking how this can affect both final size and I/O performance, or getting rid of the optional compressors like LZO or UCL in your existing files in case you want to use them with generic HDF5 tools that do not have support for these filters.