The Fastest Way to Extract Values from Geospatial Data

A considerable part of what we do at Terramonitor can be summed up by the extract, transform, load (ETL) paradigm. This happens both in internal processes and in those that are externally triggered.

  • Internally, when we acquire raw remoting sensing data from a satellite data provider and store preprocessed data in our own database.
  • Externally, when a client requests any part of that data via our user interface, data API or their own GIS client.

The internal process is offline and not very time critical. The focus is in quality of data, and optimization will generally be in the order of reducing the time a process takes from 12 hours to 2. This saves computation costs, but any process taking hours is still best described as "leave it running and get back to it".  The external process is online and is very time critical. It has to happen faster than a user can change the active tab in their browser. In this case, there is a big difference whether user response happens in 12 seconds or 2.

Cover Photo by Markus Spiske on Unsplash

The Three Iterations of Land Cover API

Our Land Cover API allows the user to query information for a disc-shaped area from any point on the globe, and receive a textual answer consisting of remote sensing data related to said area. The data to fetch is stored as GeoTIFFs in a cloud storage. The following three iterations are what we went through to get where we are, from slowest to fastest.

I – naïve implementation

Fully functional and very quick to implement, our first implementation was roughly as follows:

  1. download GeoTIFF from cloud storage
  2. run Python process and perform geospatial query with GDAL on GeoTIFF
  3. output pixel values

II – cloud optimized GeoTIFF

There was a lot of unnecessary data being transferred in the first implementation, as generally only a very small portion of a GeoTIFF is used per query. So the first improvement was optimizing data transfer.

  1. download only requested part of GeoTIFF from cloud storage using GDAL and cloud optimized GeoTIFF
  2. run Python process and perform geospatial query with GDAL on GeoTIFF
  3. output pixel values

III – hex dump (current)

The second implementation was fairly fast, but now the bottleneck was the Python process performing a geospatial query, which isn't even necessary as the downloaded data already contains only the requested information. And you don't need Python if you are just going to output the contents of a file. So we got rid of the Python process completely.

  1. download only requested part of GeoTIFF from cloud storage using GDAL and cloud optimized GeoTIFF
  2. transform GeoTIFF into a PNM image
  3. run od (hex dump file contents) to output pixel values
A visualization of the current Terramonitor Land Cover API technical implementation for query handling

The earth observation company Planet released a blog post about reading a single pixel from a GeoTIFF file without using GDAL. In other words, building your own GeoTIFF reader. This requires deep understanding of the cloud optimized GeoTIFF format, including understanding the possible compression algorithms used.

We are happy to use gdallocationinfo for single pixel queries, and would happily use a similar GDAL binary for multiple pixels if one existed. Our current solution is not to read the contents of the GeoTIFF file at all, but transform it to a considerably simpler file format, and read that instead. In practice, this implementation is magnitudes faster than implementation I or II.

Processing Geospatial Data

If you are working with a large number of GeoTIFF files and are running into performance or hosting issues, please contact us for details on our implementation.

Contact us via a form or send an e-mail to sales@terramonitor.com