Spatial data in Python

Jeff Allen, Aniket Kali

📥 Click here to download this document and any associated data and images

There are a variety of libraries, methods, and functions for working with spatial data in Python.

GeoPandas is a library for working with spatial (typically geographic) data. It is used a lot since it extends the functionality of pandas to support spatial data types and operations, making it easier to analyze, visualize, and manipulate spatial data.

Many of the tasks that are typically done within a GIS can be done via geopandas. And often, it is preferable to work in geopandas versus point-and-click GIS (e.g. QGIS) since it’s easier to chain together and automate processes, have access to the full capabilities of pandas and other Python libraries, as well as can be better scaled for handling very large datasets.

This tutorial provides an introduction to geopandas, its geometry data types, how to quickly plot spatial data, and methods for spatial calculations (e.g. computing areas, bounding boxes, etc.). The last section of this tutorial provides direction for next steps, including libraries and examples for doing advanced spatial data processing, analyses, and visualizations.

This tutorial assumes you have base knowledge of working with spatial data (e.g. in GIS) and working with dataframes in pandas. Check out the following if you want a refresher:

We’ll begin this tutorial by loading in pandas, geopandas, and matplotlib (the latter for showing how to quickly plot and view data). If you don’t have these installed already, you’ll have to install them via pip or conda.

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt

Loading and exploring geometric data

Geospatial data represents real-world features using three primary geometric types:

Points: Single (x,y) coordinates for discrete locations like transit stops or landmarks.
Lines: Connected sequences of points forming paths, such as roads or rivers.
Polygons: Closed shapes defining areas like census tracts or property boundaries.

transit_stops = gpd.read_file("data/ttc_stops.geojson")
transit_routes = gpd.read_file("data/ttc_routes.geojson")
wards = gpd.read_file("data/city-wards.geojson")

In geopandas, we typically load data with a more agnostic read_file() function. For this workshop, we’re going to use four sources of data:

transit_stops: each of the stops for the TTC
transit_routes: each of the lines for the TTC
wards: polygons of City of Toronto wards (i.e. council districts)

Let’s take a look at the first layer, the transit stops. We have two columns with text, and a third with geometry data.

Here the data are coded as a MULTILINESTRING, essentially a combination of lines that combine into one object.

transit_routes

	STATUS	TECHNOLOGY	NAME	geometry
0	Existing	Subway	Line 4: Sheppard Subway	MULTILINESTRING ((-79.41092 43.76152, -79.4096...
1	Existing	Subway	Line 1: Yonge-University Subway	MULTILINESTRING ((-79.52727 43.79351, -79.5261...
2	Existing	Subway	Line 2: Bloor-Danforth Subway	MULTILINESTRING ((-79.26453 43.73227, -79.2669...
3	In Delivery	LRT / BRT	Eglinton Crosstown LRT	MULTILINESTRING ((-79.26453 43.73227, -79.2679...
4	In Delivery	LRT / BRT	Eglinton Crosstown West Extension	MULTILINESTRING ((-79.48726 43.68739, -79.4901...
5	In Delivery	Subway	Scarborough Subway Extension	MULTILINESTRING ((-79.26453 43.73227, -79.2630...
6	In Delivery	LRT / BRT	Finch West LRT	MULTILINESTRING ((-79.49099 43.76349, -79.4922...
7	In Delivery	Subway	Ontario Line	MULTILINESTRING ((-79.35168 43.69644, -79.3414...
8	In Delivery	Subway	Yonge North Subway Extension	MULTILINESTRING ((-79.41557 43.77977, -79.4157...

The data can be manipulated like a regular pandas data frame. For example, if we want to filter out routes that currently do not operate, like the incomplete “Eglinton Crosstown LRT” and defunct “Scarborough RT”, we can do so by filtering the STATUS column.

transit_routes = transit_routes.loc[transit_routes["STATUS"] == "Existing"]
transit_routes

	STATUS	TECHNOLOGY	NAME	geometry
0	Existing	Subway	Line 4: Sheppard Subway	MULTILINESTRING ((-79.41092 43.76152, -79.4096...
1	Existing	Subway	Line 1: Yonge-University Subway	MULTILINESTRING ((-79.52727 43.79351, -79.5261...
2	Existing	Subway	Line 2: Bloor-Danforth Subway	MULTILINESTRING ((-79.26453 43.73227, -79.2669...

Let’s do the same for the stops.

The geometry for the stops are simpler, just POINT.

transit_stops = transit_stops.loc[transit_stops["STATUS"] == "Existing"]
transit_stops

	LOCATION_N	STATUS	TECHNOLOGY	NAME	geometry
0	Kipling	Existing	Subway	Line 2: Bloor-Danforth Subway	POINT (-79.53628 43.63694)
1	Islington	Existing	Subway	Line 2: Bloor-Danforth Subway	POINT (-79.5246 43.64533)
2	Royal York	Existing	Subway	Line 2: Bloor-Danforth Subway	POINT (-79.51129 43.64812)
3	Old Mill	Existing	Subway	Line 2: Bloor-Danforth Subway	POINT (-79.49509 43.65008)
4	Jane	Existing	Subway	Line 2: Bloor-Danforth Subway	POINT (-79.48446 43.6498)
...	...	...	...	...	...
69	Vaughan Metropolitan Centre	Existing	Subway	Line 1: Yonge-University Subway	POINT (-79.52727 43.79351)
70	Sheppard-Yonge	Existing	Subway	Line 4: Sheppard Subway	POINT (-79.41092 43.76152)
71	Spadina	Existing	Subway	Line 2: Bloor-Danforth Subway	POINT (-79.40397 43.66729)
72	St. George	Existing	Subway	Line 2: Bloor-Danforth Subway	POINT (-79.39931 43.66828)
73	Bloor-Yonge	Existing	Subway	Line 2: Bloor-Danforth Subway	POINT (-79.38572 43.671)

74 rows × 5 columns

Before we go on, it’s important to have an idea of the metadata of geometric files that we work with. There’s two key parts to this.

CRS (Coordinate Reference System): The crs attribute defines a geodataset’s spatial “coordinate system” (e.g., latitude/longitude, meters-based projections). We can use it to ensure layers align—for example, combining Toronto census tracts (EPSG:3347) with a Web Mercator basemap (EPSG:3857).
Total Bounds: The total_bounds attribute returns the min/max coordinates (xmin, ymin, xmax, ymax) of your data’s extent. It’s useful for setting map zoom levels or clipping other datasets to the same area—like focusing a transit map on Toronto’s downtown core.

transit_stops.crs

<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

transit_stops.total_bounds

array([-79.53628,  43.63694, -79.26453,  43.79351])

Plotting geographic data

We explore geometry simply by plotting using .plot(). We can do this for any row, or the entire GeoDataFrame

transit_stops.plot()

This is the default plot, but we can tweak the colours, add multiple layers, and change some of the layout options using matplotlib, a commonly used Python plotting library. Here’s a very simple schematic of rapid transit in Toronto.

fig, ax = plt.subplots(ncols = 1, figsize=(4, 4))

wards.plot(
    linewidth = 1,
    color="LightGray",
    edgecolor="White",
    ax = ax
)

transit_stops.plot(
    color="Black",
    markersize = 6,
    ax = ax
)

transit_routes.plot(
    linewidth = 1,
    color="Black",
    ax = ax
).set_axis_off()

Interactive Exploration

GeoPandas’ explore() function generates an interactive Leaflet map (like Google Maps) from your geodata. We can use it to better understand the data we are working with and how it might be viewed from the user side on a web application (e.g., Svelte).

You’ll need to install a couple libraries in order for this to work - matplotlib, folium, and mapclassify. This can be done in the environment that you’re working in with a command like pip install folium matplotlib mapclassify.

transit_stops.explore(
    column='NAME',
    tiles="CartoDB Positron", 
    marker_kwds={"radius": 7}
)

Make this Notebook Trusted to load map: File -> Trust Notebook

Geometery properties

geopandas has a number of functions you can run on geometry columns that can be super useful for analyzing and summarizing data. Here’s a list of a few examples of using these to create a new column of data (more can be found in the official documentation)

Property	Description	Example
`.area`	Area of the geometry (in CRS units)	`gdf["area"] = gdf.geometry.area`
`.length`	Line length (perimeter length for polygons)	`gdf["length"] = gdf.geometry.length`
`.is_valid`	Geometry validity check	`gdf["valid"] = gdf.geometry.is_valid`
`.is_empty`	Whether geometry is empty	`gdf["empty"] = gdf.geometry.is_empty`
`.is_simple`	No self-intersections	`gdf["simple"] = gdf.geometry.is_simple`
`.type`	Type of geometry (e.g., Polygon)	`gdf["geom_type"] = gdf.geometry.type`
`.bounds`	Bounding box as (minx, miny, maxx, maxy)	`gdf["bounds"] = gdf.geometry.bounds`

Processing and analyzing spatial data

This tutorial provided a quick overview of how we can work with spatial data in geopandas. However, this just scratched the surface of what is possible

If you’re interested the why and how of spatial data processing, i.e. about converting spatial data from one format to another (e.g. generating centroids of polygons, buffers around points, joining and linking multiple spatial datasets to each other, etc.), check out our Processing spatial data tutorial.

Beyond that, you can check out the following Python libraries for working with spatial data:

Shapely (wide variety of tools for manipulation of 2d geometry data, some of which geopandas uses.)
Pysal (spatial data analysis library, including various descriptive statistics and spatial modelling functions)
Rasterio (library for working with raster data)
Folium (more interactive maps!)