Using HDF5 to serialize Pandas Dataframes

Pickling was the default method of saving DataFrames in versions of pandas before 0.12, pickling has issues and doesn’t offer compression. A month’s worth of buoy displacement data annotated is 256Mb for an example month.

buoy_displacements_dataframe.to_hdf(‘buoy_data’, ‘displacements’, mode=’table’, append=True, complib=’blosc’, complevel=’9′)

When using the maximum compression with the blosc compression library that comes down to 109Mb, if you append another month to your existing HDF5 you get even bigger disk space savings, an example of two months worth of data which would have been 512Mb pickled, 204Mb in separate HDF5 files becomes 189Mb.

Although the disk size is much smaller when the DataFrames are read into memory they return to the original size, so memory becomes the limiting factor. So querying the HDF5 file becomes important, although it significantly increases the load time, although these queries appear to be come faster over time I believe due to an index being built.

import pandas as pd

buoy_data_hdf = pd.HDFStore(‘buoy_data’)

buoy_data_hdf.select(“displacements”, [ pd.Term(‘index’, ‘>’, pd.Timestamp(‘20130215’)) ] )