Working with large datasets in Pylearn2

By default Pylearn2 loads all the dataset to the main memory (not GPU). This could be problematic for large datasets. There exists multiple Python/Numpy solutions for dealing with large data:

Pylearn2 currently only supports Pytables and h5py. (memmap support has been introduced in the latest version of Theano. But it has not been tested with Pylearn2 yet.)

PyTables

pylearn2.datasets.dense_design_matrix.DenseDesignMatrixPyTables is designed to mimic the behaviour of DenseDesignMatrix but underneath it stores the data in PyTables hdf5 file format. pylearn2.datasets.svhn.SVHN is a good example of how to make a DenseDesignMatrixPyTables object and store your data in it.

h5py

If you have your data already saved in hdf5 format, you can use pylearn2.datasets.hdf5.HDF5Dataset class to access your data in Pylearn2. For an example of how to save data in hdf5 format and load it with HDF5Dataset, take a look at pylearn2.datasets.tests.test_hdf5.TestHDF5Dataset.

PyTables VS h5py

Each library has its own comparison:

One advantage of h5py over PyTables is that one can use hdf5 files made with other libraries, whereas PyTables hdf5 files are not standard. PyTables also adds some performance-enhancing features and supports LZO and bzip2 compression in addition to zlib (h5py supports gzip and LZF out-of-the-box).

Known issues

  • Both hdf5 based solutions are know to crash when the data is accessed in a random order. To avoid this issue, we suggest to use one of the ‘sequential’ or ‘batchwise_shuffled_sequential’ iterator schemes.
  • Writing large amount of data to hdf5 at once is been know to result in crash. So it’s advised to use mini-batches to write the data to files. Some of the prepossessing functions has mini-batch options, but not all of them.
  • Users should be aware that any changes to the data will be saved to the data on disk (except in cases where HDF5Dataset is used with load_all=True).