![]() |
Multivariate Pattern Analysis in Python |
PyMVPA includes a number of ready-to-use classifiers, which are described in the following sections. All classifiers implement the same, very simple interface. Each classifier object takes all relevant parameters as arguments to its constructor. Once instantiated, the classifier object’s train() method can be called with some dataset. This trains the classifier using all samples in the respective dataset.
The major task for a classifier is to make predictions. Predictions are made by calling the classifier’s predict() method with one or multiple data samples. predict() operates on pure sample data and not datasets, as in some cases the true label for a sample might be totally unknown.
This examples demonstrates the typical daily life of a classifier.
>>> import numpy as N
>>> from mvpa.clfs.knn import kNN
>>> from mvpa.datasets import Dataset
>>> training = Dataset(samples=N.array(
... N.arange(100),ndmin=2, dtype='float').T,
... labels=[0] * 50 + [1] * 50)
>>> rand100 = N.random.rand(10)*100
>>> validation = Dataset(samples=N.array(rand100, ndmin=2, dtype='float').T,
... labels=[ int(i>50) for i in rand100 ])
>>> clf = kNN(k=10)
>>> clf.train(training)
>>> N.mean(clf.predict(training.samples) == training.labels)
1.0
>>> N.mean(clf.predict(validation.samples) == validation.labels)
1.0
Two datasets with 100 and 10 samples each are generated. Both datasets only have one feature and the associated label is 0 if the feature value is below 50 or 1 otherwise. The larger dataset contains all integers in the interval (0,100) and is used to train the classifier. The smaller is used as a validation dataset, to check whether the classifier learned something that generalizes well across samples not included in the training dataset. In this case the validation dataset consists of 10 random floating point values in the interval (0,100).
The classifier in this example is a k-Nearest-Neighbour classifier that makes use of the 10 nearest neighbours of a data sample to make its predictions (k=10). One can see that after the training the classifier performs optimally on the training dataset as well as on the validation data samples.
The choice of the classifier in the above example is more or less arbitrary. Any classifier in PyMVPA could be used in place of kNN. This demonstrates another useful feature of PyMVPA’s classifiers. Due to the high-level abstraction and the simple interface, almost all classifiers can be combined with most algorithms in PyMVPA. This makes it very easy to test different classifiers on some dataset (see Fig. 1).
A comparison of the behavior of different classifiers (k-Nearest-Neighbour, linear SVM, logistic regression, ridge regression and SVM with radial basis function kernel) on a simple classification problem. The code to generate these figure can be found in the pylab_2d.py example.
Before looking at the different classifiers in more detail, it is important to mention another feature common to all of them. While their interface is simple, classifiers are in no way limited to report only predictions. All classifiers implement an additional interface: the so-called Stateful interface. Objects of any class that is derived from Stateful have attributes (we refer to such attributes as state variables), which are conditionally computed and stored by PyMVPA. Such conditional storage and access is handy if a variable of interest might consume a lot of memory or needs intensive computation, and not needed in most (or in some) of the use cases.
For instance, the Classifier class defines the trained_labels state variable, which just stores the unique labels for which the classifier was trained. Since trained_labels stores meaningful information only for a trained classifier, attempt to access ‘clf.trained_labels’ before training would result in a raised UnknownStateError exception since the classifier has not seen the data yet and, thus, does not know the labels. In other words, ‘clf’ is not yet in the state to know anything about the labels, hence the name Stateful. We will refer to instances of classes derived from Stateful as ‘statefull’. Any state variable can be enabled or disabled on per instance basis at any time of the execution.
To continue the last example, each classifier, or more precisely every statefull object, can be asked to report existing state-related attributes:
>>> list_with_verbose_explanations = clf.states.listing
‘clf.states’ is an instance of StateCollection class which is a container for all state variables of the given class. Although values can be queried or set (if state is enabled) operating directly on the statefull object
>>> clf.trained_labels
Set([0, 1])
any other operation on the state (e.g. enabling, disabling) has to be carried out through the StateCollection ‘.states’.
>>> print clf.states
{trained_dataset predicting_time*+ training_confusion predictions*+...}
>>> clf.states.enable('values')
>>> print clf.states
{trained_dataset predicting_time*+ training_confusion predictions*+...}
>>> clf.states.disable('values')
A string representation of the state collection mentioned above lists all state variables present accompanied with 2 markers: ‘+’ for an enabled state variable, and ‘*’ for a variable that stores some value (but might have been disabled already and, therefore, would have no ‘+’ and attempts to reassign it would result in no action).
By default all classifiers provide state variables values, predictions. The latter is simply the set of predictions that was returned by the last call to the objects predict() method. The former is heavily classifier-specific. By convention the values key provides access to the raw values that a classifier prediction is based on. Depending on the classifier, this information might required significant resources when stored. Therefore all states can be disabled or enabled (states.disable(), states.enable()) and their current status can be queried like this:
>>> clf.states.isActive('predictions')
True
>>> clf.states.isActive('values')
False
States can be enabled or disabled during statefull object construction, if enable_states or disable_states (or both) arguments, which store the list of desired state variables names, passed to the object constructor. Keyword ‘all’ can be used to select all known states for that statefull object.
The TransferError class provides a convenient way to determine the transfer error of a trained classifier on some validation dataset. A TransferError object is instanciated by passing a classifier object to the constructor. Optionally a custom error function can be specified (see errorfx argument).
To compute the transfer error simply call the object with a validation dataset. The computed error value is returned. TransferError also supports a state variable confusion that contains the full confusion matrix of the predictions made on the validation dataset. The confusion matrix is disabled by default.
If the TransferError object is called with an optional training dataset, the contained classifier is first training using this dataset before predictions on the validation dataset are made.
>>> from mvpa.clfs.transerror import TransferError
>>> clf = kNN(k=10)
>>> terr = TransferError(clf)
>>> terr(validation, training )
0.0
Often one is not only interested in a single transfer error on one validation dataset, but on a cross-validated estimate of the transfer error. A popular method is the so-called leave-one-out cross-validation.
The CrossValidatedTransferError class provides a simple way to compute such measure. It utilizes a TransferError object and a Splitter. When called with a Dataset the splitter generates splits of the Dataset and the transfer error for all splits is computed by training on one of the splitted datasets and making predictions on the other. By default the mean of transfer errors is returned (but the actual combiner function is customizable).
The following example shows the minimal code for a leave-one-out cross-validation reusing the transfer error object from the previous example and some Dataset data.
>>> # create some dataset
>>> from mvpa.misc.data_generators import normalFeatureDataset
>>> data = normalFeatureDataset(perlabel=50, nlabels=2,
... nfeatures=20, nonbogus_features=[3, 7],
... snr=3.0)
>>> # now cross-validation
>>> from mvpa.algorithms.cvtranserror import CrossValidatedTransferError
>>> from mvpa.datasets.splitter import NFoldSplitter
>>> cvterr = CrossValidatedTransferError(terr,
... NFoldSplitter(cvtype=1))
>>> error = cvterr(data)
(to be written)
The kNN classifier makes predictions based on the labels of nearby samples. It currently uses Euclidian distance to determine the nearest neighbours, but future enhancements may include support for other kernels.
[1] | Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004). A new method for variable subset selection, with the lasso and “epsilon” forward stagewise methods as special cases. Annals of Statistics, 32, 407-499. |
The penalized logistic regression (PLR) is similar to the ridge in that it has a penalty term, however, it is trained to predict a binary outcome by means of the logistic function (Wikipedia entry about logistic regression).
Ridge regression (aka Tikhonov regularization) is a variant of a linear regression (Wikipedia entry about ridge regression).
The ridge regression classifier (RidgeReg) performs a simple linear regression with a penalty parameter to help avoid over-fitting. The regression inserts an intercept term so that you do not have to center your data.
Sparse Multinomial Logistic Regression [2] is a fast multi-class classifier that can easily with high-dimensional problems (research paper about SMLR). PyMVPA include two implementations: one in pure Python and a faster one that makes use of a C extension for the performance critical pieces of the code.
[2] | Krishnapuram, B., Figueiredo, M., Carin, L., & Hartemink, A. (2005). Sparse Multinomial Logistic Regression: Fast Algorithms and Generalization Bounds. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 957–968. |
Support vector machines [3] classifiers (and regressions) are popular since they can deal with very high dimensional problems (Wikipedia entry about SVM), while maintaining reasonable generalization performance.
The support vector machine classes provide a family of classifiers by wrapping libsvm and Shogun libraries, with corresponding base classes libsvm.SVM and sg.SVM accordingly. By default SVM class is bound to libsvm’s implementation if such is available (shogun otherwise).
While any SVM class provides a complete interface, the others child classes make it easy to run some subset of standard classifiers, such as linear SVM, with a default set of parameters (see LinearCSVMC, LinearNuSVMC, RbfNuSVMC and RbfCSVMC).
[3] | Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer, New York. |
To facilitate easy trial of different classifiers for any specific task, Warehouse of classifiers clfs.warehouse.clfs was defined to create a sample collection of some commonly used parameterizations of the classifiers present in PyMVPA. Such collection can be queried by any set of known keywords/tags with tags prefixed with ! being excluded:
>>> from mvpa.clfs.warehouse import clfs
>>> print len(clfs['multiclass', '!svm'])
8
to simply sweep through classifiers which are capable of multiclass classification and are not SVM based.