Creating spectra from displacement data

Recently I developed some code to create spectra from the ‘raw’ displacement data which is output from Datawell Waverider MKII/MKIII buoys. Part of the reason for doing this was to see the effect that filtering the displacement data had on the spectra produced and also to emulate and understand the spectra creation that takes place in Datawell’s W@ves21 software.

It was also to improve my understanding of fourier transforms and the functionality there was available to carry out these transforms in the Python ecosystem. I developed this code using Scipy and Welch’s method. This is a similar approach to what Datawell describe in their Waverider manual. A number of segments are taken, in this case 8 segments of 256 records.

I made use of the IPython Notebook to develop the code, I felt the ability to use inline plots and export as a PDF or upload as a gist and then render using nbviewer was useful. The ability to share the code with text output, plots and a narrative description in Markdown is quite powerful in my opinion, also using a gist allows version control of the ipynb file.

This approach can read in an existing raw file or make use of the displacement DataFrames created by hebtools. This allows spectra to be averaged over longer periods to get an average for a month or any length of time greater than 200 seconds. The monthly plots can show spectral change over the year. It also allows easy inter buoy spectral comparison.

Potential differences between the W@ves21 output may come from the way in which points are binned this could explain some of the sharp peaks seen in the Datawell output which is spread between two points in the output from the Scipy implementation of Welch’s method.

The rendered output can be viewed here

Advertisements

Using HDF5 to serialize Pandas Dataframes

Pickling was the default method of saving DataFrames in versions of pandas before 0.12, pickling has issues and doesn’t offer compression. A month’s worth of buoy displacement data annotated is 256Mb for an example month.

buoy_displacements_dataframe.to_hdf(‘buoy_data’, ‘displacements’, mode=’table’, append=True, complib=’blosc’, complevel=’9′)

When using the maximum compression with the blosc compression library that comes down to 109Mb, if you append another month to your existing HDF5 you get even bigger disk space savings, an example of two months worth of data which would have been 512Mb pickled, 204Mb in separate HDF5 files becomes 189Mb.

Although the disk size is much smaller when the DataFrames are read into memory they return to the original size, so memory becomes the limiting factor. So querying the HDF5 file becomes important, although it significantly increases the load time, although these queries appear to be come faster over time I believe due to an index being built.

import pandas as pd

buoy_data_hdf = pd.HDFStore(‘buoy_data’)

buoy_data_hdf.select(“displacements”, [ pd.Term(‘index’, ‘>’, pd.Timestamp(‘20130215’)) ] )

SSD and eSATA

Recently I bought a Samsung EVO 840 120GB SSD and a DeLock ESATA/USB3.0 caddy. My work laptop ( Dell Precision M6400 ) has an eSATAP socket, which looks like an ordinary USB socket but with a slightly wider indent at the top. Once I switched the BIOS mode from SATA to AHCI I was able to boot off the SSD connected through the eSATAP port. Having formatted the disk in Windows 7 for NTFS, I resized the partition in Windows and selected the unused space for the primary partition and swap space in the Ubuntu installer. Although it is slightly clumsy when moving the laptop, it is a great way to get a fast Linux experience without changing your internal disk.

Performance is noticeably better when compared to the 5400rpm internal disk ( 10 seconds from GRUB to Login screen ) and Ubuntu 12.04 has better support for the USB 3.0 ExpressCard I am using. I also tested using a virtual machine with the VDI located on the SSD while running Windows 7 on the 5400rpm disk as the host, the VM ran noticeably faster. I am impressed with the stability of eSATA, I have experienced a little hanging but I do not believe it is an issue with the eSATA connection.

It is possible to get a caddy to replace your optical drive, allowing you to have a second hard disk.

Working with large shapefiles

A file containing gridded bathymetry data in the form of a very large ( 6.95GB ) point shapefile was given to part of our group. Working with this in a GUI based GIS tool was incredibly slow, the task was to split this file up into smaller more manageable files. Due to the size of this dataset I don’t think the shapefile is the best method of storing what is essentially XYZ data.

Having  searched online it became clear that ogr2ogr could do a conversion of a point shapefile to a comma separated variable file. Already installed on my work Windows laptop as part of fwtools and QGIS, I set the process going with:

ogr2ogr -f CSV output.csv input.shp

I could see the output.csv file growing steadily in size and the system resources use seemed steady so I left the machine for several hours. Once the command finished executing and I then used the Unix split command ( installed as part of Cygwin ) to split the files by line break into 15x200MB files. Which can be loaded into a tool like QGIS relatively easily.

split -C 200m output.csv bathy

My colleague used a slightly different approach to get specific rectangular subsets of the shapefile and do a conversion to a different spatial reference system.

ogr2ogr -progress -f "CSV" -select "GRID_CODE" -t_srs EPSG:4326 output.csv input.shp -lco GEOMETRY=AS_XY -clipsrc 585850 6608645 625850 6648645

This transformation led to a large number of recurring fractions. There was no obvious approach to applying precision with ogr2ogr so a short python script making use of Numpy and Pandas was written.

import numpy as np
import os
import pandas as pd

bathy_files = os.listdir('.')
for bathy_file in bathy_files:
    bathy_df = pd.read_csv(bathy_file, names=['X','Y','Z'])
    bathy_df = np.round(bathy_df, 7)
    bathy_df['Z'] = np.round(bathy_df['Z'], 2)
    bathy_df.to_csv(bathy_file[:-4] + '_round.csv', index=False)

The uncompressed rounded files totalled 1.94GB compared with an uncrompressed shapefile of almost 7GB. The compressed filesize for the csv files was 644MB versus 1GB for the compressed shapefile. The usability of the data was improved by splitting the dataset into manageable chunks.

Sikuli and IPython

At work we have a years worth of time series data from 3 sensors. Most of the data is organised into 30 minutes files and foldered by month. We have some basic unsupported proprietary gui based Windows software called W@ves21. This software does some visualization and processing of the data on a per file basis, with no way to process in batch. I believe these batch services are available as a bespoke paid service.

We came to a situation where we needed to process one months worth of one type of 30 minutes files for each sensor, totalling in excess of 3000 files, this would have been a very laborious task fraught with error.

When collaborating with another organisation we used an application called Sikuli to drive a piece of proprietary software called WaveLab, when setting up multiple runs. It was a successful experiment, Sikuli is based on image recognition so it doesn’t need to understand your application in the way old software automation tools did. It was developed by the AI lab at MIT.

When you are writing a Sikuli script you highlight the area of the screen you want the application to carry out an event on. This needs a bit of a shift in thinking because you need to think about the application in terms of visually unique areas that the image recognition process can understand.

There is an ability to tweak the matching algorithm in terms of fuzzyness with a sliding percentage scale, so if you have lots of similar items with very small differences you can increase the percentage slider have a stricter .

The code is Java based, so run on Mac, Linux and Windows. the scripting is done in Jython so uses Python Syntax. The scripts can be a bit fragile but the more detail you add the more robust they can become.

You need to watch out for taking the scripts onto different machines, some applications take on the native feel of their OS to some extent and this can have an effect on the original images used with the script, if you understand the logic it shouldn’t long to update the script accordingly.

Due to a number of issues, mainly the reliability of the software being driven, the process was not completed for one month at one time. So after completion of the Sikuli script you have approximately 1500 files from one sensor and some of the original files weren’t processed due to user error in loading up the files, the trouble is how do you recognise the few files missing from a sequence of over a 1000?

The file names were time stamped as part of the Sikuli process, each filename consisted of ‘sensor_name, year-month-day hour-minute-second’.

Having dealt a lot with time series files in Python and missing sections. I developed an approach of grabbing the filenames as a list of strings, stripping everything but the date and time and using datetime.strptime to parse a date time object from the string. Then convert the datetime object to a Unix timestamp using time.mktime then you can pass the correctly ordered array of unix timestamps to numpy.diff to get the difference between successive values. Most difference values are roughly the same and you can use a mask to expose the large values and use numpy.where function to find their position you can pass this position back into the original filename array searching forward a few places in the index to find the corresponding gaps. From there you can check if the original files for those time periods exist.

IPython is an invaluable tool for this procedure, the read print eval method of iterative coding is just right for this problem, coupled with the straightforward syntax and string/date/time/numpy/os libraries of Python.

Having IPython on a machine is like having a very powerful easily scriptable terminaI. It is starting to become a must have for myself on whichever box I’m working on. Anaconda Community Edition is a nice distribution of Python with a lot of the useful packages .

I need to work out a way to run IPython notebook reliably at work. I can run it with local access but having an instance which ran across the network securely would be a powerful incentive to encourage other folks to engage with Python.

64 bit computing, Python, Numpy and Pandas

I have been using 64 bit Windows 7, on my work laptop. It seems a robust system. Better than Vista or buggy XP 64. As my work machine a Dell Precision M90 is about 5 years old. There isn’t much of a performance improvement as the motherboard chipset limits the memory use to 3.25GB.

I have been doing a lot of work on time series data, particularly half hourly files and iterating over these and processing them. Using Pandas for this has been invaluable. If I batch process the time series data ( it is structured into one folder a month ) and run process several months worth of files, I end up hitting memory problems once the commit size of the process nears 2GB on Python 32 bit, we have some more powerful workstations at work ( one with 32GB ) and I ran the process again and hit the same problem, I realized it must be a cap in 32 bit Python, so I installed 64 bit python and got the unofficial Windows 64 bit binaries of the libraries I needed and no memory issues.

64 bit computing is a revelation when using a proper 64 bit stack and decent hardware. It is invaluable when working with large data sets.

Quadro FX 1500m on Windows 7

There is a tendancy of creating Nvidia cards that aren’t directly supported by Nvidia an example of this is the Quadro FX 1500m on the Dell Precision M90 ( 2007 era ) which I believe is also available on some HP high end laptops. These cards were high performance but a bit buggy. I have seen two of the Quadro FX 2500m cards on different Precision M90 fail I believe because of heat issues. Previously I have only been able to use Dell drivers on the M90 but Dell doesn’t offer any drivers for Windows 7 on these cards.

The GPU used in these cards is the G71GLM which is also used in the Geforce Go 7900 GS, there are a set of BETA drivers available for the Geforce for Windows 7, once the initial executable ( 197.48 ) extracts its files you can edit ( default path )  C:\NVIDIA\WinVista64\IS\Display\nvdm.inf  on line 136 from 

%NVIDIA_G71.DEV_0298.1%   = nv_G7x,        PCI\VEN_10DE&DEV_0298&SUBSYS_019B1028
to ( taken from the Hardware ID for the FX1500m graphics adapter in device manager )
%NVIDIA_G71.DEV_0298.1%   = nv_G7x,        PCI\VEN_10DE&DEV_029B&SUBSYS_019B1028
I presume this also works with FX 2500m and 3500m.