Working with large shapefiles

A file containing gridded bathymetry data in the form of a very large ( 6.95GB ) point shapefile was given to part of our group. Working with this in a GUI based GIS tool was incredibly slow, the task was to split this file up into smaller more manageable files. Due to the size of this dataset I don’t think the shapefile is the best method of storing what is essentially XYZ data.

Having  searched online it became clear that ogr2ogr could do a conversion of a point shapefile to a comma separated variable file. Already installed on my work Windows laptop as part of fwtools and QGIS, I set the process going with:

ogr2ogr -f CSV output.csv input.shp

I could see the output.csv file growing steadily in size and the system resources use seemed steady so I left the machine for several hours. Once the command finished executing and I then used the Unix split command ( installed as part of Cygwin ) to split the files by line break into 15x200MB files. Which can be loaded into a tool like QGIS relatively easily.

split -C 200m output.csv bathy

My colleague used a slightly different approach to get specific rectangular subsets of the shapefile and do a conversion to a different spatial reference system.

ogr2ogr -progress -f "CSV" -select "GRID_CODE" -t_srs EPSG:4326 output.csv input.shp -lco GEOMETRY=AS_XY -clipsrc 585850 6608645 625850 6648645

This transformation led to a large number of recurring fractions. There was no obvious approach to applying precision with ogr2ogr so a short python script making use of Numpy and Pandas was written.

import numpy as np
import os
import pandas as pd

bathy_files = os.listdir('.')
for bathy_file in bathy_files:
    bathy_df = pd.read_csv(bathy_file, names=['X','Y','Z'])
    bathy_df = np.round(bathy_df, 7)
    bathy_df['Z'] = np.round(bathy_df['Z'], 2)
    bathy_df.to_csv(bathy_file[:-4] + '_round.csv', index=False)

The uncompressed rounded files totalled 1.94GB compared with an uncrompressed shapefile of almost 7GB. The compressed filesize for the csv files was 644MB versus 1GB for the compressed shapefile. The usability of the data was improved by splitting the dataset into manageable chunks.