Table Of Contents

Previous topic

mvpa.datasets

Next topic

mvpa.datasets.channel

This Page

Quick search

mvpa.datasets.base

Dataset container

The comprehensive API documentation for this module, including all technical details, is available in the Epydoc-generated API reference for mvpa.datasets.base (for developers).

Dataset

class mvpa.datasets.base.Dataset(data=None, dsattr=None, dtype=None, samples=None, labels=None, labels_map=None, chunks=None, origids=None, check_data=True, copy_samples=False, copy_data=True, copy_dsattr=True)

Bases: object

The Dataset.

This class provides a container to store all necessary data to perform MVPA analyses. These are the data samples, as well as the labels associated with the samples. Additionally, samples can be grouped into chunks.

Groups:
  • Creators: __init__, selectFeatures, selectSamples, applyMapper
  • Mutators: permuteLabels

Important: labels assumed to be immutable, i.e. noone should modify them externally by accessing indexed items, ie something like dataset.labels[1] += "_bad" should not be used. If a label has to be modified, full copy of labels should be obtained, operated on, and assigned back to the dataset, otherwise dataset.uniquelabels would not work. The same applies to any other attribute which has corresponding unique* access property.

Initialize dataset instance

There are basically two different way to create a dataset:

  1. Create a new dataset from samples and sample attributes. In this mode a two-dimensional ndarray has to be passed to the samples keyword argument and the corresponding samples attributes are provided via the labels and chunks arguments.

  2. Copy contructor mode

    The second way is used internally to perform quick coyping of datasets, e.g. when performing feature selection. In this mode and the two dictionaries (data and dsattr) are required. For performance reasons this mode bypasses most of the sanity check performed by the previous mode, as for internal operations data integrity is assumed.

Parameters:
  • data (dict) – Dictionary with an arbitrary number of entries. The value for each key in the dict has to be an ndarray with the same length as the number of rows in the samples array. A special entry in this dictionary is ‘samples’, a 2d array (samples x features). A shallow copy is stored in the object.
  • dsattr (dict) – Dictionary of dataset attributes. An arbitrary number of arbitrarily named and typed objects can be stored here. A shallow copy of the dictionary is stored in the object.
  • dtype (type | None) – If None – do not change data type if samples is an ndarray. Otherwise convert samples to dtype.
Keywords:
samples : ndarray

2d array (samples x features)

labels

An array or scalar value defining labels for each samples

labels_map : None or bool or dict

Map from labels into literal names. If is None or True, the mapping is computed, from labels which must be literal. If is False, no mapping is computed. If dict – mapping is verified and taken, labels get remapped. Dict must map literal -> number

chunks

An array or scalar value defining chunks for each sample

Each of the Keywords arguments overwrites what is/might be already in the data container.

C
chunks
I
origids
L
labels
S
samples
UC
attrib
UL
attrib
applyMapper(featuresmapper=None, samplesmapper=None, train=True)

Obtain new dataset by applying mappers over features and/or samples.

While featuresmappers leave the sample attributes information unchanged, as the number of samples in the dataset is invariant, samplesmappers are also applied to the samples attributes themselves!

Applying a featuresmapper will destroy any feature grouping information.

Parameters:
  • featuresmapper (Mapper) – Mapper to somehow transform each sample’s features
  • samplesmapper (Mapper) – Mapper to transform each feature across samples
  • train (bool) – Flag whether to train the mapper with this dataset before applying it.
TODO: selectFeatures is pretty much
applyMapper(featuresmapper=MaskMapper(...))
chunks
chunks
convertFeatureIds2FeatureMask(ids)

Returns a boolean mask with all features in ids selected.

Parameters:
  • ids (list or 1d array) – To be selected features ids.
Return type:

ndarray

Returns:

All selected features are set to True; False otherwise.

convertFeatureMask2FeatureIds(mask)

Returns feature ids corresponding to non-zero elements in the mask.

Parameters:
  • mask (1d ndarray) – Feature mask.
Return type:

ndarray

Returns:

Ids of non-zero (non-False) mask elements.

copy()
Create a copy (clone) of the dataset, by fully copying current one
defineFeatureGroups(definition)
getLabelsMap()
Stored labels map (if any)
getNFeatures()
Number of features per pattern.
getNSamples()
Currently available number of patterns.
getRandomSamples(nperlabel)

Select a random set of samples.

If ‘nperlabel’ is an integer value, the specified number of samples is randomly choosen from the group of samples sharing a unique label value ( total number of selected samples: nperlabel x len(uniquelabels).

If ‘nperlabel’ is a list which’s length has to match the number of unique label values. In this case ‘nperlabel’ specifies the number of samples that shall be selected from the samples with the corresponding label.

The method returns a Dataset object containing the selected samples.

idhash

To verify if dataset is in the same state as when smth else was done

Like if classifier was trained on the same dataset as in question

idsbychunks(x)
attrib
idsbylabels(x)
attrib
idsonboundaries(prior=0, post=0, attributes_to_track=[, 'labels', 'chunks'], affected_labels=None, revert=False)

Find samples which are on the boundaries of the blocks

Such samples might need to be removed. By default (with prior=0, post=0) ids of the first samples in a ‘block’ are reported

Parameters:
  • prior (int) – how many samples prior to transition sample to include
  • post (int) – how many samples post the transition sample to include
  • attributes_to_track (list of basestring) – which attributes to track to decide on the boundary condition
  • affected_labels (list of basestring) – for which labels to perform selection. If None - for all
  • revert (bool) – either to revert the meaning and provide ids of samples which are found to not to be boundary samples
index(*args, **kwargs)

Universal indexer to obtain indexes of interesting samples/features. See .select() for more information

Return:tuple of (samples indexes, features indexes). Each item could be also None, if no selection on samples or features was requested (to discriminate between no selected items, and no selections)
labels
labels
labels_map
Stored labels map (if any)
nfeatures
Number of features per pattern.
nsamples
Currently available number of patterns.
origids
origids
permuteLabels(status, perchunk=True, assure_permute=False)

Permute the labels.

TODO: rename status into something closer in semantics.

Parameters:
  • status (bool) – Calling this method with set to True, the labels are permuted among all samples. If ‘status’ is False the original labels are restored.
  • perchunk (bool) – If True permutation is limited to samples sharing the same chunk value. Therefore only the association of a certain sample with a label is permuted while keeping the absolute number of occurences of each label value within a certain chunk constant.
  • assure_permute (bool) – If True, assures that labels are permutted, ie any one is different from the original one
samples
samples
samplesperchunk
attrib
samplesperlabel
attrib
select(*args, **kwargs)

Universal selector

WARNING: if you need to select duplicate samples (e.g. samples=[5,5]) or order of selected samples of features is important and has to be not ordered (e.g. samples=[3,2,1]), please use selectFeatures or selectSamples functions directly

Examples:

Mimique plain selectSamples:

dataset.select([1,2,3])
dataset[[1,2,3]]

Mimique plain selectFeatures:

dataset.select(slice(None), [1,2,3])
dataset.select('all', [1,2,3])
dataset[:, [1,2,3]]

Mixed (select features and samples):

dataset.select([1,2,3], [1, 2])
dataset[[1,2,3], [1, 2]]

Select samples matching some attributes:

dataset.select(labels=[1,2], chunks=[2,4])
dataset.select('labels', [1,2], 'chunks', [2,4])
dataset['labels', [1,2], 'chunks', [2,4]]

Mixed – out of first 100 samples, select only those with labels 1 or 2 and belonging to chunks 2 or 4, and select features 2 and 3:

dataset.select(slice(0,100), [2,3], labels=[1,2], chunks=[2,4])
dataset[:100, [2,3], 'labels', [1,2], 'chunks', [2,4]]
selectFeatures(ids=None, sort=True, groups=None)

Select a number of features from the current set.

Parameters:
  • ids – iterable container to select ids
  • sort (bool) – if to sort Ids. Order matters and selectFeatures assumes incremental order. If not such, in non-optimized code selectFeatures would verify the order and sort

Returns a new Dataset object with a view of the original samples array (no copying is performed).

WARNING: The order of ids determines the order of features in the returned dataset. This might be useful sometimes, but can also cause major headaches! Order would is verified when running in non-optimized code (if __debug__)

selectSamples(ids)

Choose a subset of samples defined by samples IDs.

Returns a new dataset object containing the selected sample subset.

TODO: yoh, we might need to sort the mask if the mask is a list of ids and is not ordered. Clarify with Michael what is our intent here!

setLabelsMap(lm)

Set labels map.

Checks for the validity of the mapping – values should cover all existing labels in the dataset

setSamplesDType(dtype)
Set the data type of the samples array.
summary(uniq=True, stats=True, idhash=False, lstats=True, maxc=30, maxl=20)

String summary over the object

Parameters:
  • uniq (bool) – Include summary over data attributes which have unique
  • idhash (bool) – Include idhash value for dataset and samples
  • stats (bool) – Include some basic statistics (mean, std, var) over dataset samples
  • lstats (bool) – Include statistics on chunks/labels
  • maxc (int) – Maximal number of chunks when provide details on labels/chunks
  • maxl (int) – Maximal number of labels when provide details on labels/chunks
summary_labels(maxc=30, maxl=20)

Provide summary statistics over the labels and chunks

Parameters:
  • maxc (int) – Maximal number of chunks when provide details
  • maxl (int) – Maximal number of labels when provide details
uniquechunks
attrib
uniquelabels
attrib
where(*args, **kwargs)

Obtain indexes of interesting samples/features. See select() for more information

XXX somewhat obsoletes idsby...

See also

Derived classes might provide additional methods via their base classes. Please refer to the list of base classes (if it exists) at the begining of the Dataset documentation.

Full API documentation of Dataset in module mvpa.datasets.base.

mvpa.datasets.base.datasetmethod(func)
Decorator to easily bind functions to a Dataset class

See also

Full API documentation of datasetmethod() in module mvpa.datasets.base.