Data specifications, spaces, and sources¶
Data specifications, often called data_specs
, are used as a
specification for requesting and providing data in a certain format
across different parts of a Pylearn2 experiment.
A data_specs
is a (space, source)
pair, where source
is an
identifier or the data source or sources required (for instance, inputs
and targets), and space
is an instance of space.Space
representing the format of these data (for instance, a vector, a 3D
tensor representing an RGB image, or a one-hot vector).
The main use of data_specs
is to request data from a
datasets.Dataset
object, via an iterator. Various objects can
request data this way: models, costs, monitoring channels, training
algorithms, even some datasets that perform a transformation on data.
The Space
object¶
A Space
represents a way in which a mini-batch of data can be
formatted. For instance, a batch of RGB images (each of shape (rows,
columns)
) can be represented in different ways, for instance:
- as a matrix where each row corresponds to a different image, and is
of length
rows * columns * 3
: the corresponding space would be aspace.VectorSpace
, more preciselyVectorSpace(dim=(rows * columns * 3))
; - as a 4-dimensional tensor, where rows, columns, and channels (here:
red, green, and blue) are different axes: the corresponding space would
be a
space.Conv2DSpace
. Theano convolutions prefer that tensor to have shape(batch_size, channels, rows, columns)
, which corresponds toConv2DSpace(shape=(rows, columns), num_channels=3, axes=('b', 'c', 0, 1))
; - as a 4-dimensional tensors with a different shape: for instance,
cuda-convnet prefers
(channels, rows, columns, batch_size)
: the space would beConv2DSpace(shape=(rows, columns), num_channels=3, axes=('c', 0, 1, 'b'))
.
Spaces can be either elementary, representing one mini-batch from one
source of data, such as VectorSpace
and Conv2DSpace
mentioned
above, or composite (space.CompositeSpace
), representing
the aggregation of several sources of data (some of these may in
turn be aggregations of sources). A mini-batch for an elementary
space will usually be a NumPy ndarray
, whereas a mini-batch for a
CompositeSpace
will be a Python tuple of elementary (or composite)
mini-batches.
Notable methods of the Space
class are:
-
Space.
make_theano_batch
()¶
creates a Theano Variable (or tuple of Theano Variable in the case of
CompositeSpace
) representing a symbolic mini-batch of data. For instance,VectorSpace(...).make_theano_batch(...)
will essentially calltheano.tensor.matrix()
.-
-
Space.
validate
(batch)¶
will check that symbolic variable
batch
can correctly represent a mini-batch of data for the corresponding space. For instance,VectorSpace(...).validate(theano.tensor.matrix())
will work, butVectorSpace(...).validate(theano.tensor.vector())
will raise an exception.-
-
Space.
np_validate
(batch)¶
(
np
stands for NumPy) is similar, but operates on a mini-batch of numeric data, rather than on a symbolic variable. This enables more checks to be performed. For instance,VectorSpace(dim=3).validate(np.zeros((4, 3)))
will work, because it correctly describes a mini-batch of 4 samples of dimension 3, butVectorSpace(dim=4).validate(np.zeros((4, 3)))
will raise an exception.-
-
Space.
format_as
(batch, space)¶
and
-
Space.
np_format_as
(batch, space)¶
are the way we can convert data from their original space into the destination
space
.format_as
operates on a symbolicbatch
, and returns a symbolic expression of the newly-formatted data, whereasnp_format_as
operates on a numeric batch, and returns numeric data. This formatting can happen between different instances of the sameSpace
class, for instance, converting between two instances ofConv2DSpace
with differentaxes
amounts to correctly transpose thebatch
. It can also happen between different subclasses ofSpace
, for instance, converting between aVectorSpace
andConv2DSpace
of compatible shape involves reshaping and transposition of the data.-
Sources¶
Sources are simple identifiers that specify which data should be returned, whereas spaces specify how that data should be formatted.
An elementary source is identified by a Python string. For instance, the
most used sources are 'features'
and 'targets'
. 'features'
usually denotes the part of the data that models will use as input, and
'targets'
, for labeled datasets, contains the value the model will
try to predict. However, this is only a convention, and some datasets
will declare other sources, that can be used in varying ways by models,
for instance when using multi-modal data.
A composite source is identified by a tuple of sources. For instance,
to request features and targets from a dataset, the source would be
('features', 'targets')
.
Structure of data specifications¶
When using data specifications data_specs=(space, source)
,
space
and source
have to have the same structure. This means
that:
- if
space
is an elementary space, thensource
has to be an elementary source, i.e., a string; - if
space
is a composite space, thensource
has to be a composite source (a tuple), with exactly as many components as the number of sub-spaces ofspace
; and the corresponding sub-sources and sub-spaces again have to have the same structure.
For example, let us define the following spaces:
input_vecspace = VectorSpace(dim=(32 * 32 * 3))
input_convspace = Conv2DSpace(shape=(32, 32), num_channels=3,
axes=('b', 'c', 0, 1))
target_space = VectorSpace(dim=10)
and suppose "features"
and "targets"
are sources present in our
data. Then, the following data_specs are correct:
(input_vecspace, "features")
: only the features, mini-batches will be matrices;(input_convspace, "features")
: only the features, mini-batches will be 4-D tensors;(target_space, "targets")
: only the targets, mini-batches will be matrices;(CompositeSpace((input_vecspace, target_space)), ("features", "targets"))
: features and targets, in that order; mini-batches will be (matrix, matrix) pairs;(CompositeSpace((target_space, input_convspace)), ("targets", "features"))
: targets and features, in that order; mini-batches will be (matrix, 4-D tensor) pairs;(CompositeSpace((input_vecspace, input_vecspace, input_vecspace, target_space)), ("features", "features", "features", "targets"))
: features repeated 3 times, then targets; mini-batches will be (matrix, matrix, matrix, matrix) tuples;(CompositeSpace((CompositeSpace((input_vecspace, input_vecspace, input_vecspace)), target_space)), (("features", "features", "features"), "targets"))
: same as above, but the repeated features are in another CompositeSpace; mini-batches will be ((matrix, matrix, matrix), matrix) pairs with the first element being a triplet.
The following ones are incorrect:
(target_vecspace, "features")
: it will not crash immediately, but as soon as actual data are used, it will crash because feature data will have a width of 32 * 32 * 3 = 3072, buttarget_vecspace.dim
is 10;(CompositeSpace((input_vecspace, input_convspace)), "features")
: thesource
part has to have as many elements as there are sub-spaces of theCompositeSpace
, but"features"
is not a pair. You would need to write(CompositeSpace((input_vecspace, input_convspace)), ("features", "features"))
;(CompositeSpace((input_vecspace,)), "features")
: thesource
part should be a tuple of length 1, not a string. You would need to write(CompositeSpace((input_vecspace,)), ("features",))
;(CompositeSpace((input_vecspace, input_vecspace, input_vecspace, target_space)), (("features", "features", "features"), "targets"))
: even if the total number of elementary spaces and elementary sources match, their structure do not: the sub-spaces are in a flat tuple of length 4, the sources are in a nested tuple;(CompositeSpace((CompositeSpace((input_vecspace, input_vecspace, input_vecspace)), target_space)), ("features", "features", "features", "targets"))
: it is the same problem, the other way around.
Examples of use¶
Here are some examples of how data specifications are currently used in different Pylearn2 objects.
The big picture¶
The TrainingAlgorithm
object (for instance
DefaultTrainingAlgorithm
, or SGD
) is usually the one requesting
the data_specs from the various objects defined in an experiment
script (model, costs, monitor channels), combines them in one nested
data_specs, flattens it, requests iterators from the datasets, iterates
over the dataset, converting back the flat version of the data so it
can be correctly dispatched between all the objects requiring data.
Input of a model¶
A Model object used in an experiment has to declare its input
source and space, so the right data will be provided to it by
the dataset iterator, in the appropriate format. This is done
by the methods models.Model.get_input_source()
and
models.Model.get_input_space()
.
By default, most models will simply use "features"
as input source,
but that could be changed for an experiment where the user wants to
apply the model on a different source of the dataset, or on a dataset
where sources are named differently.
Models that do not care for the topology of the input will use a
VectorSpace
as input space, whereas convolutional models, for
instance, will use an instance of Conv2DSpace
.
Models also declare an output space, which can be useful for the cost, for instance, or for other objects that can use or embed a model.
Input of a cost¶
A Cost object needs to implement the
costs.Cost.get_data_specs(self, model)()
method, which will
be used to determine which data (and format) will be passed as the
data
argument of costs.Cost.expr(self, model, data)()
and
costs.Cost.get_gradients(self, model, data)()
.
Example 1: cost without data¶
For instance, a cost that does not depend on data at all, but only on
the model parameters, like an L1 regularization penalty, would typically
use (NullSpace(), '')
for data specifications, and expr
would be
passed data=None
.
Example 2: unsupervised cost¶
An unsupervised cost, that uses only unlabeled features, and
not targets, will usually use (model.get_input_space(),
model.get_input_source())
, so the data
passed to expr
will
directly be usable by the model.
Example 3: supervised cost¶
Finally, a supervised cost, needing both features and targets, will usually request the targets to be in the same space as the model’s predictions (the model’s output space):
def get_data_specs(self, model):
return (CompositeSpace((model.get_input_space(),
model.get_output_space())),
(model.get_input_source(),
"targets"))
Then, data
would be a pair, the first element of which can be passed
directly to the model.
Of course, it does not have to be implemented that way, and the
following is as correct (if more confusing) if you prefer having
data
be a (targets, inputs) pair instead:
def get_data_specs(self, model):
return (CompositeSpace((model.get_output_space(),
model.get_input_space())),
("targets",
model.get_input_source()))
Input of a monitoring channel¶
As for costs used for training, variables monitored by MonitorChannels have to declare data specs corresponding to the input variables necessary to compute the monitored value. It is passed directly to the constructor, for instance, when calling:
channel = MonitorChannel(
graph_inputs=input_variables,
val=monitored_value,
name='channel_name',
data_specs=data_specs,
dataset=dataset)
data_specs
describe the format and semantics of input_variables
.
As in the previous section, if val
does not need any input data,
for instance if it is a shared variable, data_specs
will be
(NullSpace(), '')
. If val
corresponds to an unsupervised cost,
or quantity depending only on the "features"
source, data_specs
could be (VectorSpace(...), "features")
, etc.
For monitored values defined in
models.Model.get_monitoring_channels(self, data)()
, the
data_specs of data
, which are also the data_specs
to
pass to MonitorChannel’s constructor, are returned by a call to
models.Model.get_monitoring_channels_data(self)()
.
Nesting and flattening data_specs¶
In order to avoid duplicating data and creating lots of symbolic inputs to Theano functions (which also do not support nested arguments), it can be useful to convert a nested, composite data_specs into a flat, non-redundant one. That flat data_specs can be used to create theano variables or get mini-batches of data, for instance, which are then nested back into the original structure of the data_specs.
We use the utils.data_specs.DataSpecsMapping
class to build a
mapping between the original, nested data specs, and the flat one.
For instance, using the spaces defined earlier:
source = ("features", ("features", "targets"))
space = CompositeSpace((input_vecspace,
CompositeSpace((input_convspace,
target_space))))
mapping = DataSpecsMapping((space, source))
flat_source = mapping.flatten(source)
# flat_source == ('features', 'features', 'targets')
flat_space = mapping.flatten(space)
# flat_space == (input_vecspace, input_convspace, target_space)
# We can use the mapping the other way around
nested_source = mapping.nest(flat_source)
assert source == flat_source
nested_space = mapping.nest(flat_space)
assert space == flat_space
# We can also nest other things
print mapping.nest((1, 2, 3))
# (1, (2, 3))
Here, 'features'
appear twice in the flat source, that is because
the corresponding space is different. However, if there is an actual
duplicate, it gets removed:
source = (("features", "targets"), ("features", "targets"))
space = CompositeSpace((CompositeSpace((input_vecspace, target_space)),
CompositeSpace((input_vecspace, target_space))))
mapping = DataSpecsMapping((space, source))
flat_source = mapping.flatten(source)
# flat_source == ('features', 'targets')
flat_space = mapping.flatten(space)
# flat_space == (input_vecspace, target_space)
# We can use the mapping the other way around
nested_source = mapping.nest(flat_source)
assert source == flat_source
nested_space = mapping.nest(flat_space)
assert space == flat_space
# We can also nest other things
print mapping.nest((1, 2))
# ((1, 2), (1, 2))
The flat tuple of spaces can be used to create non-redundant Theano input variables, which will be nested back to be dispatched between the different components having requested them:
# From the block above:
# flat_space == (input_vecspace, target_space)
flat_composite_space = CompositeSpace(flat_space)
flat_inputs = flat_composite_space.make_theano_variables(name='input')
print flat_inputs
# (input[0], input[1])
# We can use the mapping to nest the theano variables
nested_inputs = mapping.nest(theano_inputs)
print nested_inputs
# ((input[0], input[1]), (input[0], input[1]))
# Then, we can build expressions from these input variables.
# Finally, a Theano function will be compiled with
f = theano.function(flat_inputs, outputs, ...)
# A dataset iterator can also be created from the flat composite space
it = my_dataset.iterator(..., data_specs=(flat_composite_space, flat_source))
# When it is time to call f on data, we can then do
for flat_data in it:
out = f(*flat_data)