Processing and analyzing data

Aniket Kali, Jeff Allen

📥 Click here to download this document and any associated data and images


This notebook provides an introduction for analyzing urban data. It will cover …

To do this, we’ll primarily be using pandas, a Python library for analyzing and organizing tabular data. For each section, we’ll also link to relevant documentation for doing similar tasks in Microsoft Excel and Google Sheets.

If you haven’t installed pandas you’ll have to install it by “un-commenting” the line of code below (i.e., removing the # and following space at the beginning of the line).

# !pip install pandas

Next, let’s import the pandas library using the pd alias. An alias in Python is an alternate name for a library that’s typically shorter and easier to reference in the code later on.

import pandas as pd

pandas is the most common library for working with both big and small datasets in Python, and it is also the basis for working with more analytical packages (e.g. numpy, scipy, scikit-learn) and analyzing geographic data (e.g. geopandas). For each section, we’ll also link to relevant documentation for doing similar tasks in Microsoft Excel and Google Sheets.

Tables & DataFrames

A very common way of structuring and analyzing a dataset is in a 2-dimensional table format, where rows are individual records or observations, while columns are attributes (often called variables) that describe those observations. For example, each row could represent a city and the columns could indicate the population (column 1 = population) for different time periods (column 2 = year). Data stored in spreadsheets often take this format.

In pandas, a DataFrame is a tabular data structure similar to a spreadsheet, where data is organized in rows and columns. Columns can contain different kinds of data, such as numbers, strings, dates, and so on. When we load data in pandas, we typically load it into the structure of a DataFrame.

Let’s first take a look at a small dataset, which includes Census data about Canadian municipalities and their populations in 2021 and 2016. In Statistics Canada lingo, these are called Census Subdivisions. This dataset only includes municipalities with a population greater than 25,000 in 2021.

The main method for loading csv data is to use the read_csv function, but pandas can also read and write (export) many other data formats.

df = pd.read_csv("data/cities.csv")

Great! Now our data is stored in the variable df and has the structure of a DataFrame.

Viewing data

In spreadsheet software, when we open a data file, we will see the top rows of the data right away.

In pandas, we can simply type the name of the DataFrame, in this case df, in the cell to view it. By default, it will print the top and bottom rows in the DataFrame

df
Name Prov/terr Population, 2021 Population, 2016
0 Abbotsford B.C. 153524.0 141397.0
1 Airdrie Alta. 74100.0 61581.0
2 Ajax Ont. 126666.0 119677.0
3 Alma Que. 30331.0 30771.0
4 Aurora Ont. 62057.0 55445.0
... ... ... ... ...
174 Windsor Ont. 229660.0 217188.0
175 Winnipeg Man. 749607.0 705244.0
176 Wood Buffalo Alta. 72326.0 71594.0
177 Woodstock Ont. 46705.0 41098.0
178 Woolwich Ont. 26999.0 25006.0

179 rows × 4 columns

We can specify which rows we want to view.

Let’s explore what this data frame looks like. Adding the function .head(N) or .tail(N) prints the top or bottom N rows of the DataFrame. The following prints the first 10 rows.

Try to edit this to print the bottom 10 rows or a different number of rows.

df.head(10)
Name Prov/terr Population, 2021 Population, 2016
0 Abbotsford B.C. 153524.0 141397.0
1 Airdrie Alta. 74100.0 61581.0
2 Ajax Ont. 126666.0 119677.0
3 Alma Que. 30331.0 30771.0
4 Aurora Ont. 62057.0 55445.0
5 Barrie Ont. 147829.0 141434.0
6 Belleville Ont. 55071.0 50716.0
7 Blainville Que. 59819.0 NaN
8 Boisbriand Que. 28308.0 26884.0
9 Boucherville Que. 41743.0 41671.0

Notice that each column has a unique name. We can view the data of this column alone by using that name, and see what unique values exist using .unique().

Try viewing the data of another column. Beware of upper and lower case – exact names matter!

df['Prov/terr'].head(10)  # Top 10 only
0     B.C.
1    Alta.
2     Ont.
3     Que.
4     Ont.
5     Ont.
6     Ont.
7     Que.
8     Que.
9     Que.
Name: Prov/terr, dtype: object
df['Prov/terr'].unique() # Unique values for the *full* dataset
array(['B.C.', 'Alta.', 'Ont.', 'Que.', 'Man.', nan, 'N.S.', 'P.E.I.',
       'N.L.', 'N.B.', 'Sask.', 'Y.T.'], dtype=object)

What happens if you do df['Prov/terr'].head(10).unique() instead?

Filtering data

We often want to look at only a portion of our data that fit some set of conditions (e.g. all cities in a province that have a population more than 100,000). This is often called filtering, querying, or subsetting a dataset.

Let’s try to do some filtering in pandas with our data of Canadian cities.

Check out these links for filtering and sorting in spreadsheet software:

We can use the columns to identify data that we might want to filter by. The line below shows data only for Ontario, but see if you can filter for another province or territory.

df.loc[df['Prov/terr'] == 'Ont.']
Name Prov/terr Population, 2021 Population, 2016
2 Ajax Ont. 126666.0 119677.0
4 Aurora Ont. 62057.0 55445.0
5 Barrie Ont. 147829.0 141434.0
6 Belleville Ont. 55071.0 50716.0
10 Bradford West Gwillimbury Ont. 42880.0 35325.0
... ... ... ... ...
171 Whitby Ont. 138501.0 128377.0
172 Whitchurch-Stouffville Ont. 49864.0 45837.0
174 Windsor Ont. 229660.0 217188.0
177 Woodstock Ont. 46705.0 41098.0
178 Woolwich Ont. 26999.0 25006.0

67 rows × 4 columns

Pandas allows us to use other similar mathematical concepts to filter for data. Previously, we asked for all data in Ontario.

Now, filter for all cities which had a population of at least 100,000 in 2021.

HINT: in Python, “greater than or equals to” (i.e., “at least”) is represented using the syntax >=.

Pandas also allows us to combine filtering conditions.

Use the template below to select for all cities in Ontario with a population of over 100,000 in 2021.

df.loc[(df["Prov/terr"] == "Ont.") & (YOUR CONDITION HERE)]

Now let’s count how many cities actually meet these conditions. Run the line below to see how many cities there are in this data set in Ontario.

df.loc[df['Prov/terr'] == 'Ont.'].count()
Name                66
Prov/terr           67
Population, 2021    66
Population, 2016    65
dtype: int64

The function .count() tells us how much data there is for each column - but if we wanted to just see one column, we could also filter for that individual column using df[COL_NAME].

Try a different condition and count the amount of data for it.

Sorting data

You might have noticed that these cities are in alphabetical order - what if we wanted to see them in the order of population? In pandas, we do this using the sort_values function. The default is to sort in ascending order, so we set this to be False (i.e. descending) so the most populous cities are at the top.

Check out these links for sorting data in Excel and Google Sheets - Sorting in Excel - Sorting in Google Sheets

df.sort_values(by='Population, 2021', ascending=False)
Name Prov/terr Population, 2021 Population, 2016
158 Toronto Ont. 2794356.0 2731571.0
89 Montréal Que. 1762949.0 1704694.0
19 Calgary Alta. 1306784.0 1239220.0
106 Ottawa Ont. 1017449.0 934243.0
42 Edmonton Alta. 1010899.0 933088.0
... ... ... ... ...
115 Prince Edward County Ont. 25704.0 24735.0
78 NaN N.S. 25545.0 24863.0
40 Drummondville Que. NaN 75423.0
109 Peterborough Ont. NaN NaN
169 West Kelowna B.C. NaN NaN

179 rows × 4 columns

Let’s put some of this together now.

Filter the data to show all cities which are in the province of Quebec with at least a population of 50,000 in 2016, and then try to sort the cities by their 2016 population.

HINT: You can do this in two steps (which is more readable) by storing the data that you filter into a variable called df_filtered, then running the command to sort the values on df_filtered.

Exporting data

Once we have processed and analyzed our data, we may want to save it for future use or share it with others. pandas makes it easy to export data to various formats, such as CSV or Excel files.

Below, we demonstrate how to export a DataFrame to both CSV and Excel formats. This is particularly useful for sharing results or viewing the data in other tools like spreadsheet software.

# Save to CSV
df_filtered.to_csv('df_filtered.csv', index=False)

# Save to Excel
df_filtered.to_excel('df_filtered.xlsx', index=False)

Updating and renaming columns

Often, the data we have might not be in the condition we want it to be in. Some data might be missing, and other data might have odd naming conventions. In Excel or Google Sheets, you can simply change a column name by typing the edits you want in the appropriate cell. However, Python requires different steps.

Here’s a simple example. We might want all city names to be in lowercase - the code below does this for us.

df['Name'] = df['Name'].str.lower()
df
Name Prov/terr Population, 2021 Population, 2016
0 abbotsford B.C. 153524.0 141397.0
1 airdrie Alta. 74100.0 61581.0
2 ajax Ont. 126666.0 119677.0
3 alma Que. 30331.0 30771.0
4 aurora Ont. 62057.0 55445.0
... ... ... ... ...
174 windsor Ont. 229660.0 217188.0
175 winnipeg Man. 749607.0 705244.0
176 wood buffalo Alta. 72326.0 71594.0
177 woodstock Ont. 46705.0 41098.0
178 woolwich Ont. 26999.0 25006.0

179 rows × 4 columns

DataFrames are easily modifiable. pandas has a number of methods like str.lower() to do so (see the full documentation). The important thing to note here is that we directly modified the existing values of a column. We might not always want to do this, but it is often a good way of saving computer memory by not duplicating data.

Likewise, we might want better names for the columns we have. Take a look at the documentation for .rename() and modify the data frame df such that we rename the column Name to City. Pandas has more methods than we could ever remember, so learning to navigate the documentation is a crucial part of using the library.

Handling missing data

Unfortunately, it is pretty common that the datasets we work with will be missing data values for some rows and columns. This can create complications when we want to produce summary statistics or visualizations. There are different strategies for dealing with this (i.e., imputing data), but the easiest is simply to remove missing values. But first let’s check how much data is missing.

df.isnull().sum()
Name                3
Prov/terr           4
Population, 2021    3
Population, 2016    4
dtype: int64

It seems that each column has a couple of data points missing. Let’s take a look at which rows these occur in. Similar to how we created a condition to filter for certain data, the code below creates a condition to filter for rows with missing data.

df.loc[df.isnull().any(axis=1)]
Name Prov/terr Population, 2021 Population, 2016
7 blainville Que. 59819.0 NaN
18 caledon NaN 76581.0 66502.0
30 NaN Ont. 101427.0 92013.0
40 drummondville Que. NaN 75423.0
51 grimsby Ont. 28883.0 NaN
64 la prairie NaN 26406.0 24110.0
78 NaN N.S. 25545.0 24863.0
86 mission NaN 41519.0 38554.0
109 peterborough Ont. NaN NaN
131 NaN Que. 29954.0 27359.0
157 timmins NaN 41145.0 41788.0
169 west kelowna B.C. NaN NaN

You can see that some rows are missing multiple values, while others are just missing one. We can remove rows which have missing data using the function dropna and assign it to df so that we will be working only with complete data going forward. Try to modify the code below to drop rows that contain empty values in one of the two population columns. Put another way, we want to keep rows with population values, even if the name or province values are missing. Look at the documentation to figure this out, specifically the argument subset for .dropna()

df = df.dropna()

Great. Now let’s reset to our original data frame and exclude any rows with missing values.

df = pd.read_csv("data/cities.csv")
df = df.dropna()

Instead of dropping (i.e. removing) rows with missing values, we can instead replace them with specific values with fillna. For example, df.fillna(0) would replace all missing data values with a 0. This wouldn’t make sense in our data of Canadian cities, but can be useful in other contexts.

Similarly we can programatically find and replace data in any other column or across our dataset. For example, if we wanted to rename 'Y.T.' to 'Yukon' we would run the replace function as so df = df.replace('Y.T.', 'Yukon')


If you are using spreadsheet software such as Excel or Google Sheets, there are different ways of approaching this problem. Explore some of the common ways to clean data and handle missing values in the following resources:

Creating new columns

We can add or delete columns as needed. Let’s first add a column which shows the change in population between 2021 and 2016 and then sort by the cities that experienced the greatest population loss. We can calculate this via a simple subtraction as follows.

df["pop_change"] = df["Population, 2021"] - df["Population, 2016"]
df.sort_values("pop_change", ascending = False).head(5)
Name Prov/terr Population, 2021 Population, 2016 pop_change
106 Ottawa Ont. 1017449.0 934243.0 83206.0
42 Edmonton Alta. 1010899.0 933088.0 77811.0
19 Calgary Alta. 1306784.0 1239220.0 67564.0
11 Brampton Ont. 656480.0 593638.0 62842.0
158 Toronto Ont. 2794356.0 2731571.0 62785.0

Pandas supports mathematical equations between columns, just like the subtraction we did above. Create a new column called pct_pop_change that computes the percentage change in population between 2016 and 2021, and sort by the cities with the greatest increase.

HINT: the way to compute percent change is 100 * (Y - X) / X.

Now let’s clear these new columns out using drop to return to what we had originally.

df = df.drop(columns=['pop_change', 'pct_pop_change'])

In Excel or Google Sheets, you can easily create a new column by following these steps, then typing out the mathematical formula you wish to use to define the values in your new column in the top-most cell of that column, and then dragging that cell all the way down the length of your new column.

Concatenating and joining tables

Much of the time, we are working with multiple sets of data which may overlap in key ways. Perhaps we want to include measures of income in cities, or look at voting patterns - this may require us to combine multiple data frames so we can analyze them.

Pandas methods to combine data frames are quite similar to that of database operations - if you’ve worked with SQL before, this will come very easily to you. There’s an extensive tutorial on this topic, but we’ll focus on simple cases of pd.concat() and pd.merge().

First, we can use pd.concat() to stack DataFrames vertically when new data has the same columns but additional rows (e.g., adding cities from another region). This is like adding new entries to a database table.

df_ny_cities = pd.read_csv("./data/new_york_cities.csv")
combined = pd.concat([df, df_ny_cities], ignore_index=True)
print("Original rows:", len(df), "| After append:", len(combined))
display(combined.tail(5))  # Show new rows
Original rows: 167 | After append: 169
Name Prov/terr Population, 2021 Population, 2016
164 Wood Buffalo Alta. 72326.0 71594.0
165 Woodstock Ont. 46705.0 41098.0
166 Woolwich Ont. 26999.0 25006.0
167 New York City N.Y. 8804190.0 8537673.0
168 Buffalo N.Y. 278349.0 258071.0

Next, we can use pd_merge() (or pd.concat(axis=1)) to combine DataFrames side-by-side when they share a key column (e.g., city names). Here, we’ll add a column denoting whether the city is a provincial capital by matching city names.

Let’s first load this data:

df_capitals = pd.read_csv('./data/capitals.csv')
df_capitals
Name Is_Capital
0 Toronto True
1 Québec True
2 Victoria True
3 Edmonton True
4 Winnipeg True
5 Fredericton True
6 Halifax True
7 Charlottetown True
8 St. John's True
9 Regina True
10 Whitehorse True
df_with_capitals = pd.merge(
    df,  # Original data
    df_capitals,  # New data
    on="Name",  # The same column 
    how="left"  # Keep all data from the "left", ie. original
)

# Set non-capitals to False
df_with_capitals["Is_Capital"] = df_with_capitals["Is_Capital"].astype('boolean').fillna(False)

# Verify: Show capitals and counts
print(f"Found {df_with_capitals['Is_Capital'].sum()} capitals:")
df_with_capitals[df_with_capitals["Is_Capital"]]
Found 11 capitals:
Name Prov/terr Population, 2021 Population, 2016 Is_Capital
23 Charlottetown P.E.I. 38809.0 36094.0 True
38 Edmonton Alta. 1010899.0 933088.0 True
41 Fredericton N.B. 63116.0 58721.0 True
49 Halifax N.S. 439819.0 403131.0 True
108 Québec Que. 549459.0 531902.0 True
111 Regina Sask. 226404.0 215106.0 True
139 St. John's N.L. 110525.0 108860.0 True
147 Toronto Ont. 2794356.0 2731571.0 True
154 Victoria B.C. 91867.0 85792.0 True
161 Whitehorse Y.T. 28201.0 25085.0 True
163 Winnipeg Man. 749607.0 705244.0 True


To do this using spreadsheet software like Excel or Google Sheets, check out this tutorial, which lays out a few different options. One of the most common approaches for joining two separate datasets is the =VLOOKUP formula.

Summary statistics

A dataset is only as good as its ability to help us understand/describe the world around us. There are some basic methods in pandas we can use to get an idea of what the data says.

The most basic function you can use to get an idea of what’s going on is .describe(). It shows you how much data there is, and a number of summary statistics.

Modify the code below to report summary statistics for cities in Quebec only.

df.describe()
Population, 2021 Population, 2016
count 1.670000e+02 1.670000e+02
mean 1.573169e+05 1.487358e+05
std 3.074452e+05 2.959960e+05
min 2.570400e+04 2.378700e+04
25% 3.770050e+04 3.449850e+04
50% 6.414100e+04 6.316600e+04
75% 1.348910e+05 1.255940e+05
max 2.794356e+06 2.731571e+06

Instead of picking out and examining a subset of the data one by one, we can use the function .groupby() to examine several subsets concurrently. Given a column name, it will group rows which have the same value. In the example below, that means grouping every row which has the same province name. We can then apply a function to this (or multiple functions using .agg()) to examine different aspects of the data.

# Group by province and calculate total population
province_pop = df.groupby('Prov/terr')['Population, 2021'].sum()
print("Total population by province:")
province_pop.sort_values(ascending=False)
Total population by province:
Prov/terr
Ont.      11598308.0
Que.       5474109.0
B.C.       3662922.0
Alta.      3192892.0
Man.        800920.0
Sask.       563966.0
N.S.        533513.0
N.B.        240595.0
N.L.        137693.0
P.E.I.       38809.0
Y.T.         28201.0
Name: Population, 2021, dtype: float64
# Multiple aggregation statistics
stats = df.groupby('Prov/terr')['Population, 2021'].agg(['count', 'mean', 'max', 'sum'])
stats
count mean max sum
Prov/terr
Alta. 17 187817.176471 1306784.0 3192892.0
B.C. 29 126307.655172 662248.0 3662922.0
Man. 2 400460.000000 749607.0 800920.0
N.B. 4 60148.750000 79470.0 240595.0
N.L. 2 68846.500000 110525.0 137693.0
N.S. 2 266756.500000 439819.0 533513.0
Ont. 64 181223.562500 2794356.0 11598308.0
P.E.I. 1 38809.000000 38809.0 38809.0
Que. 41 133514.853659 1762949.0 5474109.0
Sask. 4 140991.500000 266141.0 563966.0
Y.T. 1 28201.000000 28201.0 28201.0

Below, we’ve added a column which shows the percent growth for each city. Use .groupby('Prov/terr') and use the function .median() (just as we used .sum() above) to observe the median growth rate per province or territory. Make sure to use sort_values() and set ascending to False.

df['Percent_Growth'] = 100 * (df['Population, 2021'] - df['Population, 2016']) / df['Population, 2016'] 


In spreadsheet software like Excel or Google Sheets, you can calculate the descriptive statistics for your dataset by using individual formulas that you apply to the specific columns you wish to analyze, such as =COUNT (count), =MIN (minimum), =MAX (maximum), =AVERAGE (mean), =MEDIAN (median), =PERCENTILE (percentiles, which can be used for quartiles, quantiles, or any percentile you wish), =STDEVA (standard deviation) functions. If it’s available to you in your Excel subscription service, you can also explore Excel’s Analysis Toolpak add-in, which allows for more efficient and advanced statistical analysis.

Cross tabulations and pivot tables

A cross tabulation, also called a frequency table or a contingency table, is a table that summarizes two categorical variables by displaying the number of occurrences in each pair of categories.

Here’s an example. Let’s count the number of cities in each province by city size.

We will need two tools to do this: - cut, which bins continuous numbers (like population) into categories (e.g., “Small”/“Medium”/“Large”), turning numbers into meaningful groups. - crosstab, which counts how often these groups appear together—like how many “Medium” cities exist per province—revealing patterns that raw numbers hide.

# Create a size category column
df['Size'] = pd.cut(df['Population, 2021'],
                    bins=[0, 50000, 200000, float('inf')],
                    labels=['Small', 'Medium', 'Large'])

# Cross-tab: Province vs. Size
size_table = pd.crosstab(df['Prov/terr'], df['Size'], margins=True)
size_table

Recall that we just created the column 'Percent_Growth' as well. Use these two functions to create different bins for different levels of growth and cross tabulate them.

If you’ve worked with Excel or Google Sheets, this is very similar to doing a Pivot Table. Here are a couple tutorials for spreadhseet software:

Pivot tables are able to summarize data across several categories, and for different types of summaries (e.g. sum, mean, median, etc.)

The pivot_table function in pandas to aggregate data, and has more options than the crosstab function. Both are quite similar though. Here’s an example where we are using pivot_table to sum the population in each province by the 3 size groups.

Try to edit this to compute the mean or median city Percent_Growth

pd.pivot_table(data = df, values = ['Population, 2021'], index = ['Prov/terr'], columns = ['Size'], aggfunc = 'sum', observed=True)
Population, 2021
Size Small Medium Large
Prov/terr
Alta. 234664.0 640545.0 2317683.0
B.C. 330537.0 1642753.0 1689632.0
Man. NaN 51313.0 749607.0
N.B. 28114.0 212481.0 NaN
N.L. 27168.0 110525.0 NaN
N.S. NaN 93694.0 439819.0
Ont. 930812.0 2925641.0 7741855.0
P.E.I. 38809.0 NaN NaN
Que. 806267.0 1371544.0 3296298.0
Sask. 71421.0 NaN 492545.0
Y.T. 28201.0 NaN NaN

There are often multiple ways to achieve similar results. In the previous section we looked at the groupby function to summarize data by one column, while above we used crosstab or pivot_table. However, we could also use the groupby function for this purpose. Check out the example below.

You should notice that it has created a long-table format rather than a wide-table format. Both formats can be useful. Wide-table formats are better for viewing data with only 2 categories, while long-table formats are often used for inputting data into modelling or visualization libraries.

df.groupby(['Prov/terr', 'Size'], observed=False)['Population, 2021'].agg(['sum'])
sum
Prov/terr Size
Alta. Small 234664.0
Medium 640545.0
Large 2317683.0
B.C. Small 330537.0
Medium 1642753.0
Large 1689632.0
Man. Small 0.0
Medium 51313.0
Large 749607.0
N.B. Small 28114.0
Medium 212481.0
Large 0.0
N.L. Small 27168.0
Medium 110525.0
Large 0.0
N.S. Small 0.0
Medium 93694.0
Large 439819.0
Ont. Small 930812.0
Medium 2925641.0
Large 7741855.0
P.E.I. Small 38809.0
Medium 0.0
Large 0.0
Que. Small 806267.0
Medium 1371544.0
Large 3296298.0
Sask. Small 71421.0
Medium 0.0
Large 492545.0
Y.T. Small 28201.0
Medium 0.0
Large 0.0