Urban data analysis, visualization, and storytelling

Jeff Allen; Julia Greenberg; Evelyne St-Louis; Michelle Zhang; Aniket Kali; Karen Chapple

Processing and analyzing data

Aniket Kali, Jeff Allen

📥 Click here to download this document and any associated data and images

This section will cover:

Exploring, filtering, and sorting data
Cleaning data and removing missing data
Creating new columns from existing data
Joining data from multiple tables
Computing descriptive statistics (sum, mean, etc.)
Aggregating data via cross-tabulations and pivot tables

To do this, we’ll primarily be using pandas, a Python library for analyzing and organizing tabular data. For each section, we’ll also link to relevant documentation for doing similar tasks in Microsoft Excel and Google Sheets.

If you haven’t installed pandas you’ll have to install it by “un-commenting” the line of code below (i.e., removing the # and following space at the beginning of the line).

# !pip install pandas

Next, let’s import the pandas library using the pd alias. An alias in Python is an alternate name for a library that’s typically shorter and easier to reference in the code later on.

import pandas as pd

pandas is the most common library for working with both big and small datasets in Python, and it is also the basis for working with more analytical packages (e.g. numpy, scipy, scikit-learn) and analyzing geographic data (e.g. geopandas). For each section, we’ll also link to relevant documentation for doing similar tasks in Microsoft Excel and Google Sheets.

Tables & DataFrames

A very common way of structuring and analyzing a dataset is in a 2-dimensional table format, where rows are individual records or observations, while columns are attributes (often called variables) that describe those observations. For example, each row could represent a city and the columns could indicate the population (column 1 = population) for different time periods (column 2 = year). Data stored in spreadsheets often take this format.

In pandas, a DataFrame is a tabular data structure similar to a spreadsheet, where data is organized in rows and columns. Columns can contain different kinds of data, such as numbers, strings, dates, and so on. When we load data in pandas, we typically load it into the structure of a DataFrame.

Let’s first take a look at a small dataset, which includes Census data about Canadian municipalities and their populations in 2021 and 2016. In Statistics Canada lingo, these are called Census Subdivisions. This dataset only includes municipalities with a population greater than 25,000 in 2021.

The main method for loading csv data is to use the read_csv function, but pandas can also read and write (export) many other data formats.

df = pd.read_csv("data/cities.csv")

Great! Now our data is stored in the variable df and has the structure of a DataFrame.

Viewing data

In spreadsheet software, when we open a data file, we will see the top rows of the data right away.

In pandas, we can simply type the name of the DataFrame, in this case df, in the cell to view it. By default, it will print the top and bottom rows in the DataFrame

df

	Name	Prov/terr	Population, 2021	Population, 2016
0	Abbotsford	B.C.	153524.0	141397.0
1	Airdrie	Alta.	74100.0	61581.0
2	Ajax	Ont.	126666.0	119677.0
3	Alma	Que.	30331.0	30771.0
4	Aurora	Ont.	62057.0	55445.0
...	...	...	...	...
174	Windsor	Ont.	229660.0	217188.0
175	Winnipeg	Man.	749607.0	705244.0
176	Wood Buffalo	Alta.	72326.0	71594.0
177	Woodstock	Ont.	46705.0	41098.0
178	Woolwich	Ont.	26999.0	25006.0

179 rows × 4 columns

We can specify which rows we want to view.

Let’s explore what this data frame looks like. Adding the function .head(N) or .tail(N) prints the top or bottom N rows of the DataFrame. The following prints the first 10 rows.

Try to edit this to print the bottom 10 rows or a different number of rows.

df.head(10)

	Name	Prov/terr	Population, 2021	Population, 2016
0	Abbotsford	B.C.	153524.0	141397.0
1	Airdrie	Alta.	74100.0	61581.0
2	Ajax	Ont.	126666.0	119677.0
3	Alma	Que.	30331.0	30771.0
4	Aurora	Ont.	62057.0	55445.0
5	Barrie	Ont.	147829.0	141434.0
6	Belleville	Ont.	55071.0	50716.0
7	Blainville	Que.	59819.0	NaN
8	Boisbriand	Que.	28308.0	26884.0
9	Boucherville	Que.	41743.0	41671.0

Notice that each column has a unique name. We can view the data of this column alone by using that name, and see what unique values exist using .unique().

Try viewing the data of another column. Beware of upper and lower case – exact names matter!

df['Prov/terr'].head(10)  # Top 10 only

0     B.C.
1    Alta.
2     Ont.
3     Que.
4     Ont.
5     Ont.
6     Ont.
7     Que.
8     Que.
9     Que.
Name: Prov/terr, dtype: object

df['Prov/terr'].unique() # Unique values for the *full* dataset

array(['B.C.', 'Alta.', 'Ont.', 'Que.', 'Man.', nan, 'N.S.', 'P.E.I.',
       'N.L.', 'N.B.', 'Sask.', 'Y.T.'], dtype=object)

What happens if you do df['Prov/terr'].head(10).unique() instead?

Filtering data

We often want to look at only a portion of our data that fit some set of conditions (e.g. all cities in a province that have a population more than 100,000). This is often called filtering, querying, or subsetting a dataset.

Let’s try to do some filtering in pandas with our data of Canadian cities.

Check out these links for filtering and sorting in spreadsheet software:

We can use the columns to identify data that we might want to filter by. The line below shows data only for Ontario, but see if you can filter for another province or territory.

df.loc[df['Prov/terr'] == 'Ont.']

	Name	Prov/terr	Population, 2021	Population, 2016
2	Ajax	Ont.	126666.0	119677.0
4	Aurora	Ont.	62057.0	55445.0
5	Barrie	Ont.	147829.0	141434.0
6	Belleville	Ont.	55071.0	50716.0
10	Bradford West Gwillimbury	Ont.	42880.0	35325.0
...	...	...	...	...
171	Whitby	Ont.	138501.0	128377.0
172	Whitchurch-Stouffville	Ont.	49864.0	45837.0
174	Windsor	Ont.	229660.0	217188.0
177	Woodstock	Ont.	46705.0	41098.0
178	Woolwich	Ont.	26999.0	25006.0

67 rows × 4 columns

Pandas allows us to use other similar mathematical concepts to filter for data. Previously, we asked for all data in Ontario.

Now, filter for all cities which had a population of at least 100,000 in 2021.

HINT: in Python, “greater than or equals to” (i.e., “at least”) is represented using the syntax >=.

Pandas also allows us to combine filtering conditions.

Use the template below to select for all cities in Ontario with a population of over 100,000 in 2021.

df.loc[(df["Prov/terr"] == "Ont.") & (YOUR CONDITION HERE)]

Now let’s count how many cities actually meet these conditions. Run the line below to see how many cities there are in this data set in Ontario.

df.loc[df['Prov/terr'] == 'Ont.'].count()

Name                66
Prov/terr           67
Population, 2021    66
Population, 2016    65
dtype: int64

The function .count() tells us how much data there is for each column - but if we wanted to just see one column, we could also filter for that individual column using df[COL_NAME].

Try a different condition and count the amount of data for it.

Sorting data

You might have noticed that these cities are in alphabetical order - what if we wanted to see them in the order of population? In pandas, we do this using the sort_values function. The default is to sort in ascending order, so we set this to be False (i.e. descending) so the most populous cities are at the top.

df.sort_values(by='Population, 2021', ascending=False)

	Name	Prov/terr	Population, 2021	Population, 2016
158	Toronto	Ont.	2794356.0	2731571.0
89	Montréal	Que.	1762949.0	1704694.0
19	Calgary	Alta.	1306784.0	1239220.0
106	Ottawa	Ont.	1017449.0	934243.0
42	Edmonton	Alta.	1010899.0	933088.0
...	...	...	...	...
115	Prince Edward County	Ont.	25704.0	24735.0
78	NaN	N.S.	25545.0	24863.0
40	Drummondville	Que.	NaN	75423.0
109	Peterborough	Ont.	NaN	NaN
169	West Kelowna	B.C.	NaN	NaN

179 rows × 4 columns

You can also sort data in Excel or Google Sheets.

Let’s put some of this together now.

Filter the data to show all cities which are in the province of Quebec with at least a population of 50,000 in 2016, and then try to sort the cities by their 2016 population.

HINT: You can do this in two steps (which is more readable) by storing the data that you filter into a variable called df_filtered, then running the command to sort the values on df_filtered.

Exporting data

Once we have processed and analyzed our data, we may want to save it for future use or share it with others. pandas makes it easy to export data to various formats, such as CSV or Excel files.

Below, we demonstrate how to export a DataFrame to both CSV and Excel formats. This is particularly useful for sharing results or viewing the data in other tools like spreadsheet software.

# Save to CSV
df_filtered.to_csv('df_filtered.csv', index=False)

# Save to Excel
df_filtered.to_excel('df_filtered.xlsx', index=False)

Updating and renaming columns

Often, the data we have might not be in the condition we want it to be in. For example, data could be missing or have odd naming conventions. In Excel or Google Sheets, you can simply change a column name by typing the edits you want in the appropriate cell. However, Python requires different steps.

Here’s a simple example. We might want all city names to be in lowercase:

df['Name'] = df['Name'].str.lower()
df

	Name	Prov/terr	Population, 2021	Population, 2016
0	abbotsford	B.C.	153524.0	141397.0
1	airdrie	Alta.	74100.0	61581.0
2	ajax	Ont.	126666.0	119677.0
3	alma	Que.	30331.0	30771.0
4	aurora	Ont.	62057.0	55445.0
...	...	...	...	...
174	windsor	Ont.	229660.0	217188.0
175	winnipeg	Man.	749607.0	705244.0
176	wood buffalo	Alta.	72326.0	71594.0
177	woodstock	Ont.	46705.0	41098.0
178	woolwich	Ont.	26999.0	25006.0

179 rows × 4 columns

The pandas library has a number of methods like str.lower() that can be used to modify dataframes. The important thing to note here is that we directly modified the existing values of a column by reassigning the column’s new values to the column. Sometimes we might want to create a new column with the modified information, but by not creating more data than we need to, we can save computer memory.

Likewise, we might want better names for the columns we have. Take a look at the documentation for .rename() and modify the data frame df such that we rename the column Name to City. Pandas has more methods than we could ever remember, so learning to navigate the documentation is a crucial part of using the library.

Handling missing data

Unfortunately, real-world datasets are often missing data in some rows and columns. This can create complications when we want to produce summary statistics or visualizations. There are different strategies for dealing with this (i.e., imputing data), but the easiest is simply to remove missing values. First, let’s check how much data is missing in all columns of our dataset.

df.isnull().sum()

Name                3
Prov/terr           4
Population, 2021    3
Population, 2016    4
dtype: int64

Each column has between 3 and 4 missing values. Let’s take a look at which rows these missing values are from. Similarly to how we created a condition to filter for certain data, the code below creates a condition to filter for rows with missing data.

df.loc[df.isnull().any(axis=1)]

	Name	Prov/terr	Population, 2021	Population, 2016
7	blainville	Que.	59819.0	NaN
18	caledon	NaN	76581.0	66502.0
30	NaN	Ont.	101427.0	92013.0
40	drummondville	Que.	NaN	75423.0
51	grimsby	Ont.	28883.0	NaN
64	la prairie	NaN	26406.0	24110.0
78	NaN	N.S.	25545.0	24863.0
86	mission	NaN	41519.0	38554.0
109	peterborough	Ont.	NaN	NaN
131	NaN	Que.	29954.0	27359.0
157	timmins	NaN	41145.0	41788.0
169	west kelowna	B.C.	NaN	NaN

You can see that some rows are missing multiple values, while others are just missing one. We can remove rows that have missing data using the function dropna so that we will be working only with complete data going forward.

Try to modify the code below to drop rows that contain empty values in one of the two population columns. In other words, we want to keep rows with population values, even if the name or province values are missing. Look at the documentation to figure this out, specifically the argument subset for .dropna().

df = df.dropna()

Great. Now let’s reset by re-loading our original dataframe and exclude any rows with missing values.

df = pd.read_csv("data/cities.csv")
df = df.dropna()

Instead of removing rows with missing values, we can instead replace them with specific values using fillna. For example, df.fillna(0) would replace all missing data values with a 0. This wouldn’t make sense for our Canadian cities data, but can be useful in other contexts.

Similarly, we can find and replace data in any other column or across our dataset. For example, if we wanted to rename 'Y.T.' to 'Yukon' we would run the replace function: df = df.replace('Y.T.', 'Yukon').

If you are using spreadsheet software such as Excel or Google Sheets, there are different ways of approaching this problem. Explore some of the common ways to clean data and handle missing values using the following resources:

Creating new columns

We can add or delete columns as needed. Let’s first add a column which shows the change in population between 2021 and 2016 and then sort by the cities that experienced the greatest population loss. We can calculate this via simple subtraction as follows.

df["pop_change"] = df["Population, 2021"] - df["Population, 2016"]
df.sort_values("pop_change", ascending = False).head(5)

	Name	Prov/terr	Population, 2021	Population, 2016	pop_change
106	Ottawa	Ont.	1017449.0	934243.0	83206.0
42	Edmonton	Alta.	1010899.0	933088.0	77811.0
19	Calgary	Alta.	1306784.0	1239220.0	67564.0
11	Brampton	Ont.	656480.0	593638.0	62842.0
158	Toronto	Ont.	2794356.0	2731571.0	62785.0

Pandas supports mathematical equations between columns, just like the subtraction we did above.

Create a new column called pct_pop_change that computes the percentage change in population between 2016 and 2021, and sort by the cities with the greatest increase.

HINT: the way to compute percent change is 100 * (Y - X) / X.

Now let’s clear these new columns out using the drop function to return to what we had originally.

df = df.drop(columns=['pop_change', 'pct_pop_change'])

In Excel or Google Sheets, you can easily create a new column by following these steps, then typing out the mathematical formula you wish to use to define the values in your new column in the top-most cell of that column, and then dragging that cell all the way down the length of your new column.

Concatenating and joining tables

Much of the time, we are working with multiple datasets which may relate to one another. For example, perhaps we want to measure income in cities, or look at voting patterns - this may require us to combine multiple dataframes so we can analyze multiple factors at once.

pandas methods to combine data frames are quite similar to that of database operations - if you’ve worked with SQL before, this will come very easily to you. There’s an extensive tutorial on this topic, but we’ll focus on simple cases of pd.concat() and pd.merge().

First, we can use pd.concat() to stack DataFrames vertically when new data has the same columns but additional rows (e.g., adding cities from another region). This is like adding new entries to a database table.

df_ny_cities = pd.read_csv("./data/new_york_cities.csv")
combined = pd.concat([df, df_ny_cities], ignore_index=True)
print("Original rows:", len(df), "| After append:", len(combined))
display(combined.tail(5))  # Show new rows

Original rows: 167 | After append: 169

	Name	Prov/terr	Population, 2021	Population, 2016
164	Wood Buffalo	Alta.	72326.0	71594.0
165	Woodstock	Ont.	46705.0	41098.0
166	Woolwich	Ont.	26999.0	25006.0
167	New York City	N.Y.	8804190.0	8537673.0
168	Buffalo	N.Y.	278349.0	258071.0

Next, we can use pd_merge() (or pd.concat(axis=1)) to combine DataFrames side-by-side when they share a key column (e.g., city names). Here, we’ll add a column denoting whether the city is a provincial capital by matching city names.

Let’s first load this data:

df_capitals = pd.read_csv('./data/capitals.csv')
df_capitals

	Name	Is_Capital
0	Toronto	True
1	Québec	True
2	Victoria	True
3	Edmonton	True
4	Winnipeg	True
5	Fredericton	True
6	Halifax	True
7	Charlottetown	True
8	St. John's	True
9	Regina	True
10	Whitehorse	True

df_with_capitals = pd.merge(
    df,  # Original data
    df_capitals,  # New data
    on="Name",  # The same column 
    how="left"  # Keep all data from the "left", ie. original
)

# Set non-capitals to False
df_with_capitals["Is_Capital"] = df_with_capitals["Is_Capital"].astype('boolean').fillna(False)

# Verify: Show capitals and counts
print(f"Found {df_with_capitals['Is_Capital'].sum()} capitals:")
df_with_capitals[df_with_capitals["Is_Capital"]]

Found 11 capitals:

	Name	Prov/terr	Population, 2021	Population, 2016	Is_Capital
23	Charlottetown	P.E.I.	38809.0	36094.0	True
38	Edmonton	Alta.	1010899.0	933088.0	True
41	Fredericton	N.B.	63116.0	58721.0	True
49	Halifax	N.S.	439819.0	403131.0	True
108	Québec	Que.	549459.0	531902.0	True
111	Regina	Sask.	226404.0	215106.0	True
139	St. John's	N.L.	110525.0	108860.0	True
147	Toronto	Ont.	2794356.0	2731571.0	True
154	Victoria	B.C.	91867.0	85792.0	True
161	Whitehorse	Y.T.	28201.0	25085.0	True
163	Winnipeg	Man.	749607.0	705244.0	True

To do this using spreadsheet software like Excel or Google Sheets, check out this tutorial, which lays out a few different options. One of the most common approaches for joining two separate datasets is the =VLOOKUP formula.

Summary statistics

A dataset is only as good as its ability to help us understand/describe the world around us. There are some basic methods in pandas we can use to get an idea of what the data says.

The most basic function you can use to get an idea of what’s going on is .describe(). It shows you how much data there is, and a number of summary statistics.

Modify the code below to report summary statistics for cities in Quebec only.

df.describe()

	Population, 2021	Population, 2016
count	1.670000e+02	1.670000e+02
mean	1.573169e+05	1.487358e+05
std	3.074452e+05	2.959960e+05
min	2.570400e+04	2.378700e+04
25%	3.770050e+04	3.449850e+04
50%	6.414100e+04	6.316600e+04
75%	1.348910e+05	1.255940e+05
max	2.794356e+06	2.731571e+06

Instead of picking out and examining a subset of the data one by one, we can use the function .groupby() to examine several subsets concurrently. Given a column name, it will group rows which have the same value. In the example below, that means grouping every row which has the same province name. We can then apply a function to this (or multiple functions using .agg()) to examine different aspects of the data.

# Group by province and calculate total population
province_pop = df.groupby('Prov/terr')['Population, 2021'].sum()
print("Total population by province:")
province_pop.sort_values(ascending=False)

Total population by province:

Prov/terr
Ont.      11598308.0
Que.       5474109.0
B.C.       3662922.0
Alta.      3192892.0
Man.        800920.0
Sask.       563966.0
N.S.        533513.0
N.B.        240595.0
N.L.        137693.0
P.E.I.       38809.0
Y.T.         28201.0
Name: Population, 2021, dtype: float64

# Multiple aggregation statistics
stats = df.groupby('Prov/terr')['Population, 2021'].agg(['count', 'mean', 'max', 'sum'])
stats

	count	mean	max	sum
Prov/terr
Alta.	17	187817.176471	1306784.0	3192892.0
B.C.	29	126307.655172	662248.0	3662922.0
Man.	2	400460.000000	749607.0	800920.0
N.B.	4	60148.750000	79470.0	240595.0
N.L.	2	68846.500000	110525.0	137693.0
N.S.	2	266756.500000	439819.0	533513.0
Ont.	64	181223.562500	2794356.0	11598308.0
P.E.I.	1	38809.000000	38809.0	38809.0
Que.	41	133514.853659	1762949.0	5474109.0
Sask.	4	140991.500000	266141.0	563966.0
Y.T.	1	28201.000000	28201.0	28201.0

Below, we’ve added a column which shows the percent growth for each city. Use .groupby('Prov/terr') and use the function .median() (just as we used .sum() above) to observe the median growth rate per province or territory. Make sure to use sort_values() and set ascending to False.

df['Percent_Growth'] = 100 * (df['Population, 2021'] - df['Population, 2016']) / df['Population, 2016']

In spreadsheet software like Excel or Google Sheets, you can calculate the descriptive statistics for your dataset by using individual formulas that you apply to the specific columns you wish to analyze, such as =COUNT (count), =MIN (minimum), =MAX (maximum), =AVERAGE (mean), =MEDIAN (median), =PERCENTILE (percentiles, which can be used for quartiles, quantiles, or any percentile you wish), =STDEVA (standard deviation) functions. If it’s available to you in your Excel subscription service, you can also explore Excel’s Analysis Toolpak add-in, which allows for more efficient and advanced statistical analysis.

Cross tabulations and pivot tables

A cross tabulation, also called a frequency table or a contingency table, is a table that summarizes two categorical variables by displaying the number of occurrences in each pair of categories.

Here’s an example. Let’s count the number of cities in each province by city size.

We will need two tools to do this: - cut, which bins continuous numbers (like population) into categories (e.g., “Small”/“Medium”/“Large”), turning numbers into meaningful groups. - crosstab, which counts how often these groups appear together—like how many “Medium” cities exist per province—revealing patterns that raw numbers hide.

# Create a size category column
df['Size'] = pd.cut(df['Population, 2021'],
                    bins=[0, 50000, 200000, float('inf')],
                    labels=['Small', 'Medium', 'Large'])

# Cross-tab: Province vs. Size
size_table = pd.crosstab(df['Prov/terr'], df['Size'], margins=True)
size_table

Recall that we just created the column 'Percent_Growth' as well. Use these two functions to create different bins for different levels of growth and cross tabulate them.

If you’ve worked with Excel or Google Sheets, this is very similar to doing a Pivot Table. Here are a couple tutorials for spreadhseet software:

Pivot tables are able to summarize data across several categories, and for different types of summaries (e.g. sum, mean, median, etc.)

The pivot_table function in pandas to aggregate data, and has more options than the crosstab function. Both are quite similar though. Here’s an example where we are using pivot_table to sum the population in each province by the 3 size groups.

Try to edit this to compute the mean or median city Percent_Growth

pd.pivot_table(data = df, values = ['Population, 2021'], index = ['Prov/terr'], columns = ['Size'], aggfunc = 'sum', observed=True)

	Population, 2021
Size	Small	Medium	Large
Prov/terr
Alta.	234664.0	640545.0	2317683.0
B.C.	330537.0	1642753.0	1689632.0
Man.	NaN	51313.0	749607.0
N.B.	28114.0	212481.0	NaN
N.L.	27168.0	110525.0	NaN
N.S.	NaN	93694.0	439819.0
Ont.	930812.0	2925641.0	7741855.0
P.E.I.	38809.0	NaN	NaN
Que.	806267.0	1371544.0	3296298.0
Sask.	71421.0	NaN	492545.0
Y.T.	28201.0	NaN	NaN

There are often multiple ways to achieve similar results. In the previous section we looked at the groupby function to summarize data by one column, while above we used crosstab or pivot_table. However, we could also use the groupby function for this purpose. Check out the example below.

You should notice that it has created a long-table format rather than a wide-table format. Both formats can be useful. Wide-table formats are better for viewing data with only 2 categories, while long-table formats are often used for inputting data into modelling or visualization libraries.

df.groupby(['Prov/terr', 'Size'], observed=False)['Population, 2021'].agg(['sum'])

		sum
Prov/terr	Size
Alta.	Small	234664.0
	Medium	640545.0
	Large	2317683.0
B.C.	Small	330537.0
	Medium	1642753.0
	Large	1689632.0
Man.	Small	0.0
	Medium	51313.0
	Large	749607.0
N.B.	Small	28114.0
	Medium	212481.0
	Large	0.0
N.L.	Small	27168.0
	Medium	110525.0
	Large	0.0
N.S.	Small	0.0
	Medium	93694.0
	Large	439819.0
Ont.	Small	930812.0
	Medium	2925641.0
	Large	7741855.0
P.E.I.	Small	38809.0
	Medium	0.0
	Large	0.0
Que.	Small	806267.0
	Medium	1371544.0
	Large	3296298.0
Sask.	Small	71421.0
	Medium	0.0
	Large	492545.0
Y.T.	Small	28201.0
	Medium	0.0
	Large	0.0