Using HDF5 to serialize Pandas Dataframes

Pickling was the default method of saving DataFrames in versions of pandas before 0.12, pickling has issues and doesn’t offer compression. A month’s worth of buoy displacement data annotated is 256Mb for an example month.

buoy_displacements_dataframe.to_hdf(‘buoy_data’, ‘displacements’, mode=’table’, append=True, complib=’blosc’, complevel=’9′)

When using the maximum compression with the blosc compression library that comes down to 109Mb, if you append another month to your existing HDF5 you get even bigger disk space savings, an example of two months worth of data which would have been 512Mb pickled, 204Mb in separate HDF5 files becomes 189Mb.

Although the disk size is much smaller when the DataFrames are read into memory they return to the original size, so memory becomes the limiting factor. So querying the HDF5 file becomes important, although it significantly increases the load time, although these queries appear to be come faster over time I believe due to an index being built.

import pandas as pd

buoy_data_hdf = pd.HDFStore(‘buoy_data’)

buoy_data_hdf.select(“displacements”, [ pd.Term(‘index’, ‘>’, pd.Timestamp(‘20130215’)) ] )

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s