# !pip install pandas
Processing and analyzing data
📥 Click here to download this document and any associated data and images
This notebook provides an introduction for analyzing urban data. It will cover …
- Exploring, filtering, and sorting data
- Cleaning data and removing missing data
- Creating new columns from existing data
- Joining data from multiple tables
- Computing descriptive statistics (sum, mean, etc.)
- Aggregating data via cross-tabulations and pivot tables
To do this, we’ll primarily be using pandas, a Python library for analyzing and organizing tabular data. For each section, we’ll also link to relevant documentation for doing similar tasks in Microsoft Excel and Google Sheets.
If you haven’t installed pandas
you’ll have to install it by “un-commenting” the line of code below (i.e., removing the #
and following space at the beginning of the line).
Next, let’s import the pandas
library using the pd
alias. An alias in Python is an alternate name for a library that’s typically shorter and easier to reference in the code later on.
import pandas as pd
pandas
is the most common library for working with both big and small datasets in Python, and it is also the basis for working with more analytical packages (e.g. numpy
, scipy
, scikit-learn
) and analyzing geographic data (e.g. geopandas
). For each section, we’ll also link to relevant documentation for doing similar tasks in Microsoft Excel and Google Sheets.
Tables & DataFrames
A very common way of structuring and analyzing a dataset is in a 2-dimensional table format, where rows are individual records or observations, while columns are attributes (often called variables) that describe those observations. For example, each row could represent a city and the columns could indicate the population (column 1 = population) for different time periods (column 2 = year). Data stored in spreadsheets often take this format.
In pandas
, a DataFrame
is a tabular data structure similar to a spreadsheet, where data is organized in rows and columns. Columns can contain different kinds of data, such as numbers, strings, dates, and so on. When we load data in pandas
, we typically load it into the structure of a DataFrame
.
Let’s first take a look at a small dataset, which includes Census data about Canadian municipalities and their populations in 2021 and 2016. In Statistics Canada lingo, these are called Census Subdivisions. This dataset only includes municipalities with a population greater than 25,000 in 2021.
The main method for loading csv data is to use the read_csv function, but pandas can also read and write (export) many other data formats.
= pd.read_csv("data/cities.csv") df
Great! Now our data is stored in the variable df
and has the structure of a DataFrame
.
Viewing data
In spreadsheet software, when we open a data file, we will see the top rows of the data right away.
In pandas
, we can simply type the name of the DataFrame
, in this case df
, in the cell to view it. By default, it will print the top and bottom rows in the DataFrame
df
Name | Prov/terr | Population, 2021 | Population, 2016 | |
---|---|---|---|---|
0 | Abbotsford | B.C. | 153524.0 | 141397.0 |
1 | Airdrie | Alta. | 74100.0 | 61581.0 |
2 | Ajax | Ont. | 126666.0 | 119677.0 |
3 | Alma | Que. | 30331.0 | 30771.0 |
4 | Aurora | Ont. | 62057.0 | 55445.0 |
... | ... | ... | ... | ... |
174 | Windsor | Ont. | 229660.0 | 217188.0 |
175 | Winnipeg | Man. | 749607.0 | 705244.0 |
176 | Wood Buffalo | Alta. | 72326.0 | 71594.0 |
177 | Woodstock | Ont. | 46705.0 | 41098.0 |
178 | Woolwich | Ont. | 26999.0 | 25006.0 |
179 rows × 4 columns
We can specify which rows we want to view.
Let’s explore what this data frame looks like. Adding the function .head(N)
or .tail(N)
prints the top or bottom N
rows of the DataFrame. The following prints the first 10 rows.
Try to edit this to print the bottom 10 rows or a different number of rows.
10) df.head(
Name | Prov/terr | Population, 2021 | Population, 2016 | |
---|---|---|---|---|
0 | Abbotsford | B.C. | 153524.0 | 141397.0 |
1 | Airdrie | Alta. | 74100.0 | 61581.0 |
2 | Ajax | Ont. | 126666.0 | 119677.0 |
3 | Alma | Que. | 30331.0 | 30771.0 |
4 | Aurora | Ont. | 62057.0 | 55445.0 |
5 | Barrie | Ont. | 147829.0 | 141434.0 |
6 | Belleville | Ont. | 55071.0 | 50716.0 |
7 | Blainville | Que. | 59819.0 | NaN |
8 | Boisbriand | Que. | 28308.0 | 26884.0 |
9 | Boucherville | Que. | 41743.0 | 41671.0 |
Notice that each column has a unique name. We can view the data of this column alone by using that name, and see what unique values exist using .unique()
.
Try viewing the data of another column. Beware of upper and lower case – exact names matter!
'Prov/terr'].head(10) # Top 10 only df[
0 B.C.
1 Alta.
2 Ont.
3 Que.
4 Ont.
5 Ont.
6 Ont.
7 Que.
8 Que.
9 Que.
Name: Prov/terr, dtype: object
'Prov/terr'].unique() # Unique values for the *full* dataset df[
array(['B.C.', 'Alta.', 'Ont.', 'Que.', 'Man.', nan, 'N.S.', 'P.E.I.',
'N.L.', 'N.B.', 'Sask.', 'Y.T.'], dtype=object)
What happens if you do df['Prov/terr'].head(10).unique()
instead?
Filtering data
We often want to look at only a portion of our data that fit some set of conditions (e.g. all cities in a province that have a population more than 100,000). This is often called filtering, querying, or subsetting a dataset.
Let’s try to do some filtering in pandas
with our data of Canadian cities.
Check out these links for filtering and sorting in spreadsheet software:
We can use the columns to identify data that we might want to filter by. The line below shows data only for Ontario, but see if you can filter for another province or territory.
'Prov/terr'] == 'Ont.'] df.loc[df[
Name | Prov/terr | Population, 2021 | Population, 2016 | |
---|---|---|---|---|
2 | Ajax | Ont. | 126666.0 | 119677.0 |
4 | Aurora | Ont. | 62057.0 | 55445.0 |
5 | Barrie | Ont. | 147829.0 | 141434.0 |
6 | Belleville | Ont. | 55071.0 | 50716.0 |
10 | Bradford West Gwillimbury | Ont. | 42880.0 | 35325.0 |
... | ... | ... | ... | ... |
171 | Whitby | Ont. | 138501.0 | 128377.0 |
172 | Whitchurch-Stouffville | Ont. | 49864.0 | 45837.0 |
174 | Windsor | Ont. | 229660.0 | 217188.0 |
177 | Woodstock | Ont. | 46705.0 | 41098.0 |
178 | Woolwich | Ont. | 26999.0 | 25006.0 |
67 rows × 4 columns
Pandas allows us to use other similar mathematical concepts to filter for data. Previously, we asked for all data in Ontario.
Now, filter for all cities which had a population of at least 100,000 in 2021.
HINT: in Python, “greater than or equals to” (i.e., “at least”) is represented using the syntax >=
.
Pandas also allows us to combine filtering conditions.
Use the template below to select for all cities in Ontario with a population of over 100,000 in 2021.
"Prov/terr"] == "Ont.") & (YOUR CONDITION HERE)] df.loc[(df[
Now let’s count how many cities actually meet these conditions. Run the line below to see how many cities there are in this data set in Ontario.
'Prov/terr'] == 'Ont.'].count() df.loc[df[
Name 66
Prov/terr 67
Population, 2021 66
Population, 2016 65
dtype: int64
The function .count()
tells us how much data there is for each column - but if we wanted to just see one column, we could also filter for that individual column using df[COL_NAME]
.
Try a different condition and count the amount of data for it.
Sorting data
You might have noticed that these cities are in alphabetical order - what if we wanted to see them in the order of population? In pandas, we do this using the sort_values function. The default is to sort in ascending order, so we set this to be False
(i.e. descending) so the most populous cities are at the top.
Check out these links for sorting data in Excel and Google Sheets - Sorting in Excel - Sorting in Google Sheets
='Population, 2021', ascending=False) df.sort_values(by
Name | Prov/terr | Population, 2021 | Population, 2016 | |
---|---|---|---|---|
158 | Toronto | Ont. | 2794356.0 | 2731571.0 |
89 | Montréal | Que. | 1762949.0 | 1704694.0 |
19 | Calgary | Alta. | 1306784.0 | 1239220.0 |
106 | Ottawa | Ont. | 1017449.0 | 934243.0 |
42 | Edmonton | Alta. | 1010899.0 | 933088.0 |
... | ... | ... | ... | ... |
115 | Prince Edward County | Ont. | 25704.0 | 24735.0 |
78 | NaN | N.S. | 25545.0 | 24863.0 |
40 | Drummondville | Que. | NaN | 75423.0 |
109 | Peterborough | Ont. | NaN | NaN |
169 | West Kelowna | B.C. | NaN | NaN |
179 rows × 4 columns
Let’s put some of this together now.
Filter the data to show all cities which are in the province of Quebec with at least a population of 50,000 in 2016, and then try to sort the cities by their 2016 population.
HINT: You can do this in two steps (which is more readable) by storing the data that you filter into a variable called df_filtered
, then running the command to sort the values on df_filtered
.
Exporting data
Once we have processed and analyzed our data, we may want to save it for future use or share it with others. pandas
makes it easy to export data to various formats, such as CSV or Excel files.
Below, we demonstrate how to export a DataFrame
to both CSV and Excel formats. This is particularly useful for sharing results or viewing the data in other tools like spreadsheet software.
# Save to CSV
'df_filtered.csv', index=False)
df_filtered.to_csv(
# Save to Excel
'df_filtered.xlsx', index=False) df_filtered.to_excel(
Updating and renaming columns
Often, the data we have might not be in the condition we want it to be in. Some data might be missing, and other data might have odd naming conventions. In Excel or Google Sheets, you can simply change a column name by typing the edits you want in the appropriate cell. However, Python requires different steps.
Here’s a simple example. We might want all city names to be in lowercase - the code below does this for us.
'Name'] = df['Name'].str.lower()
df[ df
Name | Prov/terr | Population, 2021 | Population, 2016 | |
---|---|---|---|---|
0 | abbotsford | B.C. | 153524.0 | 141397.0 |
1 | airdrie | Alta. | 74100.0 | 61581.0 |
2 | ajax | Ont. | 126666.0 | 119677.0 |
3 | alma | Que. | 30331.0 | 30771.0 |
4 | aurora | Ont. | 62057.0 | 55445.0 |
... | ... | ... | ... | ... |
174 | windsor | Ont. | 229660.0 | 217188.0 |
175 | winnipeg | Man. | 749607.0 | 705244.0 |
176 | wood buffalo | Alta. | 72326.0 | 71594.0 |
177 | woodstock | Ont. | 46705.0 | 41098.0 |
178 | woolwich | Ont. | 26999.0 | 25006.0 |
179 rows × 4 columns
DataFrames are easily modifiable. pandas
has a number of methods like str.lower()
to do so (see the full documentation). The important thing to note here is that we directly modified the existing values of a column. We might not always want to do this, but it is often a good way of saving computer memory by not duplicating data.
Likewise, we might want better names for the columns we have. Take a look at the documentation for .rename()
and modify the data frame df
such that we rename the column Name
to City
. Pandas has more methods than we could ever remember, so learning to navigate the documentation is a crucial part of using the library.
Handling missing data
Unfortunately, it is pretty common that the datasets we work with will be missing data values for some rows and columns. This can create complications when we want to produce summary statistics or visualizations. There are different strategies for dealing with this (i.e., imputing data), but the easiest is simply to remove missing values. But first let’s check how much data is missing.
sum() df.isnull().
Name 3
Prov/terr 4
Population, 2021 3
Population, 2016 4
dtype: int64
It seems that each column has a couple of data points missing. Let’s take a look at which rows these occur in. Similar to how we created a condition to filter for certain data, the code below creates a condition to filter for rows with missing data.
any(axis=1)] df.loc[df.isnull().
Name | Prov/terr | Population, 2021 | Population, 2016 | |
---|---|---|---|---|
7 | blainville | Que. | 59819.0 | NaN |
18 | caledon | NaN | 76581.0 | 66502.0 |
30 | NaN | Ont. | 101427.0 | 92013.0 |
40 | drummondville | Que. | NaN | 75423.0 |
51 | grimsby | Ont. | 28883.0 | NaN |
64 | la prairie | NaN | 26406.0 | 24110.0 |
78 | NaN | N.S. | 25545.0 | 24863.0 |
86 | mission | NaN | 41519.0 | 38554.0 |
109 | peterborough | Ont. | NaN | NaN |
131 | NaN | Que. | 29954.0 | 27359.0 |
157 | timmins | NaN | 41145.0 | 41788.0 |
169 | west kelowna | B.C. | NaN | NaN |
You can see that some rows are missing multiple values, while others are just missing one. We can remove rows which have missing data using the function dropna and assign it to df
so that we will be working only with complete data going forward. Try to modify the code below to drop rows that contain empty values in one of the two population columns. Put another way, we want to keep rows with population values, even if the name or province values are missing. Look at the documentation to figure this out, specifically the argument subset
for .dropna()
= df.dropna() df
Great. Now let’s reset to our original data frame and exclude any rows with missing values.
= pd.read_csv("data/cities.csv")
df = df.dropna() df
Instead of dropping (i.e. removing) rows with missing values, we can instead replace them with specific values with fillna. For example, df.fillna(0)
would replace all missing data values with a 0
. This wouldn’t make sense in our data of Canadian cities, but can be useful in other contexts.
Similarly we can programatically find and replace data in any other column or across our dataset. For example, if we wanted to rename 'Y.T.'
to 'Yukon'
we would run the replace function as so df = df.replace('Y.T.', 'Yukon')
If you are using spreadsheet software such as Excel or Google Sheets, there are different ways of approaching this problem. Explore some of the common ways to clean data and handle missing values in the following resources:
Creating new columns
We can add or delete columns as needed. Let’s first add a column which shows the change in population between 2021 and 2016 and then sort by the cities that experienced the greatest population loss. We can calculate this via a simple subtraction as follows.
"pop_change"] = df["Population, 2021"] - df["Population, 2016"]
df["pop_change", ascending = False).head(5) df.sort_values(
Name | Prov/terr | Population, 2021 | Population, 2016 | pop_change | |
---|---|---|---|---|---|
106 | Ottawa | Ont. | 1017449.0 | 934243.0 | 83206.0 |
42 | Edmonton | Alta. | 1010899.0 | 933088.0 | 77811.0 |
19 | Calgary | Alta. | 1306784.0 | 1239220.0 | 67564.0 |
11 | Brampton | Ont. | 656480.0 | 593638.0 | 62842.0 |
158 | Toronto | Ont. | 2794356.0 | 2731571.0 | 62785.0 |
Pandas supports mathematical equations between columns, just like the subtraction we did above. Create a new column called pct_pop_change
that computes the percentage change in population between 2016 and 2021, and sort by the cities with the greatest increase.
HINT: the way to compute percent change is 100 * (Y - X) / X
.
Now let’s clear these new columns out using drop to return to what we had originally.
= df.drop(columns=['pop_change', 'pct_pop_change']) df
In Excel or Google Sheets, you can easily create a new column by following these steps, then typing out the mathematical formula you wish to use to define the values in your new column in the top-most cell of that column, and then dragging that cell all the way down the length of your new column.
Concatenating and joining tables
Much of the time, we are working with multiple sets of data which may overlap in key ways. Perhaps we want to include measures of income in cities, or look at voting patterns - this may require us to combine multiple data frames so we can analyze them.
Pandas methods to combine data frames are quite similar to that of database operations - if you’ve worked with SQL before, this will come very easily to you. There’s an extensive tutorial on this topic, but we’ll focus on simple cases of pd.concat()
and pd.merge()
.
First, we can use pd.concat()
to stack DataFrames vertically when new data has the same columns but additional rows (e.g., adding cities from another region). This is like adding new entries to a database table.
= pd.read_csv("./data/new_york_cities.csv")
df_ny_cities = pd.concat([df, df_ny_cities], ignore_index=True)
combined print("Original rows:", len(df), "| After append:", len(combined))
5)) # Show new rows display(combined.tail(
Original rows: 167 | After append: 169
Name | Prov/terr | Population, 2021 | Population, 2016 | |
---|---|---|---|---|
164 | Wood Buffalo | Alta. | 72326.0 | 71594.0 |
165 | Woodstock | Ont. | 46705.0 | 41098.0 |
166 | Woolwich | Ont. | 26999.0 | 25006.0 |
167 | New York City | N.Y. | 8804190.0 | 8537673.0 |
168 | Buffalo | N.Y. | 278349.0 | 258071.0 |
Next, we can use pd_merge()
(or pd.concat(axis=1)
) to combine DataFrames side-by-side when they share a key column (e.g., city names). Here, we’ll add a column denoting whether the city is a provincial capital by matching city names.
Let’s first load this data:
= pd.read_csv('./data/capitals.csv')
df_capitals df_capitals
Name | Is_Capital | |
---|---|---|
0 | Toronto | True |
1 | Québec | True |
2 | Victoria | True |
3 | Edmonton | True |
4 | Winnipeg | True |
5 | Fredericton | True |
6 | Halifax | True |
7 | Charlottetown | True |
8 | St. John's | True |
9 | Regina | True |
10 | Whitehorse | True |
= pd.merge(
df_with_capitals # Original data
df, # New data
df_capitals, ="Name", # The same column
on="left" # Keep all data from the "left", ie. original
how
)
# Set non-capitals to False
"Is_Capital"] = df_with_capitals["Is_Capital"].astype('boolean').fillna(False)
df_with_capitals[
# Verify: Show capitals and counts
print(f"Found {df_with_capitals['Is_Capital'].sum()} capitals:")
"Is_Capital"]] df_with_capitals[df_with_capitals[
Found 11 capitals:
Name | Prov/terr | Population, 2021 | Population, 2016 | Is_Capital | |
---|---|---|---|---|---|
23 | Charlottetown | P.E.I. | 38809.0 | 36094.0 | True |
38 | Edmonton | Alta. | 1010899.0 | 933088.0 | True |
41 | Fredericton | N.B. | 63116.0 | 58721.0 | True |
49 | Halifax | N.S. | 439819.0 | 403131.0 | True |
108 | Québec | Que. | 549459.0 | 531902.0 | True |
111 | Regina | Sask. | 226404.0 | 215106.0 | True |
139 | St. John's | N.L. | 110525.0 | 108860.0 | True |
147 | Toronto | Ont. | 2794356.0 | 2731571.0 | True |
154 | Victoria | B.C. | 91867.0 | 85792.0 | True |
161 | Whitehorse | Y.T. | 28201.0 | 25085.0 | True |
163 | Winnipeg | Man. | 749607.0 | 705244.0 | True |
To do this using spreadsheet software like Excel or Google Sheets, check out this tutorial, which lays out a few different options. One of the most common approaches for joining two separate datasets is the =VLOOKUP
formula.
Summary statistics
A dataset is only as good as its ability to help us understand/describe the world around us. There are some basic methods in pandas
we can use to get an idea of what the data says.
The most basic function you can use to get an idea of what’s going on is .describe()
. It shows you how much data there is, and a number of summary statistics.
Modify the code below to report summary statistics for cities in Quebec only.
df.describe()
Population, 2021 | Population, 2016 | |
---|---|---|
count | 1.670000e+02 | 1.670000e+02 |
mean | 1.573169e+05 | 1.487358e+05 |
std | 3.074452e+05 | 2.959960e+05 |
min | 2.570400e+04 | 2.378700e+04 |
25% | 3.770050e+04 | 3.449850e+04 |
50% | 6.414100e+04 | 6.316600e+04 |
75% | 1.348910e+05 | 1.255940e+05 |
max | 2.794356e+06 | 2.731571e+06 |
Instead of picking out and examining a subset of the data one by one, we can use the function .groupby()
to examine several subsets concurrently. Given a column name, it will group rows which have the same value. In the example below, that means grouping every row which has the same province name. We can then apply a function to this (or multiple functions using .agg()
) to examine different aspects of the data.
# Group by province and calculate total population
= df.groupby('Prov/terr')['Population, 2021'].sum()
province_pop print("Total population by province:")
=False) province_pop.sort_values(ascending
Total population by province:
Prov/terr
Ont. 11598308.0
Que. 5474109.0
B.C. 3662922.0
Alta. 3192892.0
Man. 800920.0
Sask. 563966.0
N.S. 533513.0
N.B. 240595.0
N.L. 137693.0
P.E.I. 38809.0
Y.T. 28201.0
Name: Population, 2021, dtype: float64
# Multiple aggregation statistics
= df.groupby('Prov/terr')['Population, 2021'].agg(['count', 'mean', 'max', 'sum'])
stats stats
count | mean | max | sum | |
---|---|---|---|---|
Prov/terr | ||||
Alta. | 17 | 187817.176471 | 1306784.0 | 3192892.0 |
B.C. | 29 | 126307.655172 | 662248.0 | 3662922.0 |
Man. | 2 | 400460.000000 | 749607.0 | 800920.0 |
N.B. | 4 | 60148.750000 | 79470.0 | 240595.0 |
N.L. | 2 | 68846.500000 | 110525.0 | 137693.0 |
N.S. | 2 | 266756.500000 | 439819.0 | 533513.0 |
Ont. | 64 | 181223.562500 | 2794356.0 | 11598308.0 |
P.E.I. | 1 | 38809.000000 | 38809.0 | 38809.0 |
Que. | 41 | 133514.853659 | 1762949.0 | 5474109.0 |
Sask. | 4 | 140991.500000 | 266141.0 | 563966.0 |
Y.T. | 1 | 28201.000000 | 28201.0 | 28201.0 |
Below, we’ve added a column which shows the percent growth for each city. Use .groupby('Prov/terr')
and use the function .median()
(just as we used .sum()
above) to observe the median growth rate per province or territory. Make sure to use sort_values()
and set ascending to False
.
'Percent_Growth'] = 100 * (df['Population, 2021'] - df['Population, 2016']) / df['Population, 2016'] df[
In spreadsheet software like Excel or Google Sheets, you can calculate the descriptive statistics for your dataset by using individual formulas that you apply to the specific columns you wish to analyze, such as =COUNT (count), =MIN (minimum), =MAX (maximum), =AVERAGE (mean), =MEDIAN (median), =PERCENTILE (percentiles, which can be used for quartiles, quantiles, or any percentile you wish), =STDEVA (standard deviation) functions. If it’s available to you in your Excel subscription service, you can also explore Excel’s Analysis Toolpak add-in, which allows for more efficient and advanced statistical analysis.
Cross tabulations and pivot tables
A cross tabulation, also called a frequency table or a contingency table, is a table that summarizes two categorical variables by displaying the number of occurrences in each pair of categories.
Here’s an example. Let’s count the number of cities in each province by city size.
We will need two tools to do this: - cut, which bins continuous numbers (like population) into categories (e.g., “Small”/“Medium”/“Large”), turning numbers into meaningful groups. - crosstab, which counts how often these groups appear together—like how many “Medium” cities exist per province—revealing patterns that raw numbers hide.
# Create a size category column
'Size'] = pd.cut(df['Population, 2021'],
df[=[0, 50000, 200000, float('inf')],
bins=['Small', 'Medium', 'Large'])
labels
# Cross-tab: Province vs. Size
= pd.crosstab(df['Prov/terr'], df['Size'], margins=True)
size_table size_table
Recall that we just created the column 'Percent_Growth'
as well. Use these two functions to create different bins for different levels of growth and cross tabulate them.
If you’ve worked with Excel or Google Sheets, this is very similar to doing a Pivot Table. Here are a couple tutorials for spreadhseet software:
Pivot tables are able to summarize data across several categories, and for different types of summaries (e.g. sum
, mean
, median
, etc.)
The pivot_table function in pandas
to aggregate data, and has more options than the crosstab function. Both are quite similar though. Here’s an example where we are using pivot_table
to sum the population in each province by the 3 size groups.
Try to edit this to compute the mean or median city Percent_Growth
= df, values = ['Population, 2021'], index = ['Prov/terr'], columns = ['Size'], aggfunc = 'sum', observed=True) pd.pivot_table(data
Population, 2021 | |||
---|---|---|---|
Size | Small | Medium | Large |
Prov/terr | |||
Alta. | 234664.0 | 640545.0 | 2317683.0 |
B.C. | 330537.0 | 1642753.0 | 1689632.0 |
Man. | NaN | 51313.0 | 749607.0 |
N.B. | 28114.0 | 212481.0 | NaN |
N.L. | 27168.0 | 110525.0 | NaN |
N.S. | NaN | 93694.0 | 439819.0 |
Ont. | 930812.0 | 2925641.0 | 7741855.0 |
P.E.I. | 38809.0 | NaN | NaN |
Que. | 806267.0 | 1371544.0 | 3296298.0 |
Sask. | 71421.0 | NaN | 492545.0 |
Y.T. | 28201.0 | NaN | NaN |
There are often multiple ways to achieve similar results. In the previous section we looked at the groupby
function to summarize data by one column, while above we used crosstab
or pivot_table
. However, we could also use the groupby
function for this purpose. Check out the example below.
You should notice that it has created a long-table format rather than a wide-table format. Both formats can be useful. Wide-table formats are better for viewing data with only 2 categories, while long-table formats are often used for inputting data into modelling or visualization libraries.
'Prov/terr', 'Size'], observed=False)['Population, 2021'].agg(['sum']) df.groupby([
sum | ||
---|---|---|
Prov/terr | Size | |
Alta. | Small | 234664.0 |
Medium | 640545.0 | |
Large | 2317683.0 | |
B.C. | Small | 330537.0 |
Medium | 1642753.0 | |
Large | 1689632.0 | |
Man. | Small | 0.0 |
Medium | 51313.0 | |
Large | 749607.0 | |
N.B. | Small | 28114.0 |
Medium | 212481.0 | |
Large | 0.0 | |
N.L. | Small | 27168.0 |
Medium | 110525.0 | |
Large | 0.0 | |
N.S. | Small | 0.0 |
Medium | 93694.0 | |
Large | 439819.0 | |
Ont. | Small | 930812.0 |
Medium | 2925641.0 | |
Large | 7741855.0 | |
P.E.I. | Small | 38809.0 |
Medium | 0.0 | |
Large | 0.0 | |
Que. | Small | 806267.0 |
Medium | 1371544.0 | |
Large | 3296298.0 | |
Sask. | Small | 71421.0 |
Medium | 0.0 | |
Large | 492545.0 | |
Y.T. | Small | 28201.0 |
Medium | 0.0 | |
Large | 0.0 |