Reorganizing Data in DataFrames
Authors
Setup
Import Libraries
import numpy as np
import pandas as pdSection 1: Concatenation / Appends
The pd.concat() function and DataFrame.append() method take DataFrames with identical columns and make a DataFrame that is taller than either of them by stacking them on top of each other.
For example, it can turn this df1 DataFrame:
| Day | Weather |
|---|---|
| Monday | Sunny |
| Tuesday | Rainy |
and this df2 DataFrame:
| Day | Weather |
|---|---|
| Wednesday | Sunny |
| Thursday | Rainy |
into this:
| Day | Weather |
|---|---|
| Monday | Sunny |
| Tuesday | Rainy |
| Wednesday | Sunny |
| Thursday | Rainy |
with one line of code:
pd.concat([df1, df2])Note: If you’d like pandas to ignore the index of the dataframes when appending them, the following option is helpful:
pd.concat([df1, df2], ignore_index=True)Note: With the pd.concat() function, you can concatenate as many dataframes in one step as you want!
| Code | Description |
|---|---|
pd.concat([df1, df2]) |
Concatenate DataFrames vertically (stack them on top of each other). |
pd.concat([df1, df2], ignore_index=True) |
Concatenate DataFrames and reset the index to 0, 1, 2, … |
pd.concat([df1, df2, df3]) |
Concatenate multiple DataFrames in a single operation. |
Run the cells below to create three DataFrames that we’ll use to practice concatenating DataFrames with the pd.concat function.
Exercises
df1 = pd.DataFrame({'Name': ['Jackson', 'Josh', 'Jenny'], 'Age': [16, 19, 17]})
df1| Name | Age | |
|---|---|---|
| 0 | Jackson | 16 |
| 1 | Josh | 19 |
| 2 | Jenny | 17 |
df2 = pd.DataFrame({'Name': ['Jess', 'Jon', 'Joe'], 'Age': [21, 16, 23]}, index=[3, 4, 5])
df2| Name | Age | |
|---|---|---|
| 3 | Jess | 21 |
| 4 | Jon | 16 |
| 5 | Joe | 23 |
df3 = pd.DataFrame({'Name': ['Jill', 'Josephine', 'Jack'], 'Age': [5, 10, 61]})
df3| Name | Age | |
|---|---|---|
| 0 | Jill | 5 |
| 1 | Josephine | 10 |
| 2 | Jack | 61 |
Exercise: Concatenate df1 and df2 together!
Solution
pd.concat([df1, df2])| Name | Age | |
|---|---|---|
| 0 | Jackson | 16 |
| 1 | Josh | 19 |
| 2 | Jenny | 17 |
| 3 | Jess | 21 |
| 4 | Jon | 16 |
| 5 | Joe | 23 |
Exercise: Concatenate df2 and df3 together!
Solution
pd.concat([df2, df3])| Name | Age | |
|---|---|---|
| 3 | Jess | 21 |
| 4 | Jon | 16 |
| 5 | Joe | 23 |
| 0 | Jill | 5 |
| 1 | Josephine | 10 |
| 2 | Jack | 61 |
Exercise: Concatenate all three dataframes in a single line.
Solution
pd.concat([df1, df2, df3])| Name | Age | |
|---|---|---|
| 0 | Jackson | 16 |
| 1 | Josh | 19 |
| 2 | Jenny | 17 |
| 3 | Jess | 21 |
| 4 | Jon | 16 |
| 5 | Joe | 23 |
| 0 | Jill | 5 |
| 1 | Josephine | 10 |
| 2 | Jack | 61 |
Exercise: The index in these datasets is unlabelled, indicating that they potentially don’t contain useful data. Concatenate them all together so that the index of the final dataframe is simply 0-8.
Solution
pd.concat([df1, df2, df3], ignore_index=True)| Name | Age | |
|---|---|---|
| 0 | Jackson | 16 |
| 1 | Josh | 19 |
| 2 | Jenny | 17 |
| 3 | Jess | 21 |
| 4 | Jon | 16 |
| 5 | Joe | 23 |
| 6 | Jill | 5 |
| 7 | Josephine | 10 |
| 8 | Jack | 61 |
Section 2: Merge / Joins
The pd.merge() function and DataFrame.join() method take two DataFrames and make them wider by matching rows with the same values on a specified column.
For example, it can turn this df1 DataFrame:
| Day | Weather |
|---|---|
| Monday | Sunny |
| Tuesday | Rainy |
and this df2 DataFrame:
| Day | Temperature |
|---|---|
| Tuesday | 12 |
| Monday | 18 |
into this:
| Day | Weather | Temperature |
|---|---|---|
| Monday | Sunny | 18 |
| Tuesday | Rainy | 12 |
with one line of code:
df_merged = pd.merge(left=df1, right=df2, left_on="Day", right_on="Day")Just specify which columns should be matched up with each other, and it will search for the matching values automatically! If you want it to use the index, you can alternatively supply the option left_index=True and/or right_index=True.
| Code | Description |
|---|---|
pd.merge(left=df1, right=df2, left_on='col1', right_on='col2') |
Merge two DataFrames based on matching column values. |
pd.merge(left=df1, right=df2, left_index=True, right_index=True) |
Merge two DataFrames using their indices. |
pd.merge(left=df1, right=df2, on='col', how='inner') |
Inner join: keep only rows that match in both DataFrames. |
pd.merge(left=df1, right=df2, on='col', how='outer') |
Outer join: keep all rows from both DataFrames, fill with NaN where no match. |
pd.merge(left=df1, right=df2, on='col', how='left') |
Left join: keep all rows from left DataFrame, fill with NaN where no match in right. |
pd.merge(left=df1, right=df2, on='col', how='right') |
Right join: keep all rows from right DataFrame, fill with NaN where no match in left. |
Run the cells below to create three dataframes that we’ll use to practice merging with pd.merge().
Exercises
df1 = pd.DataFrame({'Name': ['Paul', 'Arash', 'Jenny'], 'Age': [16, 19, 17]})
df1| Name | Age | |
|---|---|---|
| 0 | Paul | 16 |
| 1 | Arash | 19 |
| 2 | Jenny | 17 |
Dataframe 2:
df2 = pd.DataFrame({'Name': ['Arash', 'Paul', 'Sara'], 'Weight': [32, 15, 37]})
df2| Name | Weight | |
|---|---|---|
| 0 | Arash | 32 |
| 1 | Paul | 15 |
| 2 | Sara | 37 |
Dataframe 3:
df3 = pd.DataFrame({'Name': ['Amy', 'Paul', 'Sara'], 'Height': [170, 190, 143]})
df3| Name | Height | |
|---|---|---|
| 0 | Amy | 170 |
| 1 | Paul | 190 |
| 2 | Sara | 143 |
Exercise: Merge the first two dataframes together. Who do we know both the age and weight of?
Solution
pd.merge(left = df1, right = df2, left_on = 'Name', right_on='Name')| Name | Age | Weight | |
|---|---|---|---|
| 0 | Paul | 16 | 15 |
| 1 | Arash | 19 | 32 |
Exercise: Try merging all 3 by merging twice. Who do we know everything about?
Solution
df_merge1 = pd.merge(left = df1, right = df2, left_on = 'Name', right_on='Name')
df = pd.merge(left=df_merge1, right = df3, left_on = 'Name', right_on = 'Name')
df| Name | Age | Weight | Height | |
|---|---|---|---|---|
| 0 | Paul | 16 | 15 | 190 |
Exercise: Note that the Names that weren’t present in both dataframes dropped out of the final result. If you’d like to keep them and have NaNs appear, you can change the how parameter in the pd.merge() function. Let’s try out a few options by merging dataframes 1 and 2. Use how="outer".
Solution
pd.merge(left = df1, right = df2, left_on = 'Name', right_on = 'Name', how = "outer")| Name | Age | Weight | |
|---|---|---|---|
| 0 | Arash | 19.0 | 32.0 |
| 1 | Jenny | 17.0 | NaN |
| 2 | Paul | 16.0 | 15.0 |
| 3 | Sara | NaN | 37.0 |
Exercise: Use how="left".
Solution
pd.merge(left = df1, right = df2, left_on = 'Name', right_on = 'Name', how = "left")| Name | Age | Weight | |
|---|---|---|---|
| 0 | Paul | 16 | 15.0 |
| 1 | Arash | 19 | 32.0 |
| 2 | Jenny | 17 | NaN |
Exercise: Use how="right".
Solution
pd.merge(left = df1, right = df2, left_on = 'Name', right_on = 'Name', how = "right")| Name | Age | Weight | |
|---|---|---|---|
| 0 | Arash | 19.0 | 32 |
| 1 | Paul | 16.0 | 15 |
| 2 | Sara | NaN | 37 |
Exercise: Use how="inner".
Solution
pd.merge(left = df1, right = df2, left_on = 'Name', right_on = 'Name', how = "inner")| Name | Age | Weight | |
|---|---|---|---|
| 0 | Paul | 16 | 15 |
| 1 | Arash | 19 | 32 |
Recognizing that multiple inner joins can result in high data attrition, what policies would you put in your future data analyses to both prevent data loss and keep data easy to analyze?
Section 3: Melts
The pd.melt() function and DataFrame.melt() method take a single dataframe and make it taller by taking data stored in column names and putting it into the rows along with the rest of the data, adding extra metadata in the process.
For example, it can turn this df DataFrame:
| Month | Year | Monday | Tuesday | Wednesday |
|---|---|---|---|---|
| January | 2021 | 0 | -2 | -1 |
| February | 2021 | 2 | 4 | -2 |
into this:
| Month | Year | Weekday | Temperature |
|---|---|---|---|
| January | 2021 | Monday | 0 |
| January | 2021 | Tuesday | -2 |
| January | 2021 | Wednesday | -1 |
| February | 2021 | Monday | 2 |
| February | 2021 | Tuesday | 4 |
| February | 2021 | Wednesday | -2 |
with one line of code:
pd.melt(
df,
id_vars=['Month', 'Year'], # The columns that should stay the same
value_vars=['Monday', 'Tuesday', 'Wednesday'], # The columns that should melt
var_name='Weekday', # The new Column that will represent the melted column name's variable
value_name='Temperature' # The new Column that the data represents
)Note: Melting a dataframe is also called “tidying” data, making a “long” dataframe from a “wide” dataframe, or building a design matrix.
| Code | Description |
|---|---|
pd.melt(df, id_vars=['col1'], value_vars=['col2', 'col3']) |
Melt a DataFrame by specifying which columns to keep (id_vars) and which to melt (value_vars). |
pd.melt(df, id_vars=['col1'], var_name='variable', value_name='value') |
Melt a DataFrame and specify custom names for the resulting variable and value columns. |
df.reset_index(drop=True) |
Reset the DataFrame index to 0, 1, 2, … and drop the old index. |
Let’s practice tidying dataframes with the pd.melt() function. Run the cell below to create a dataframe df containing data from the world bank dataset.
Exercises
df = (
pd.read_csv('https://raw.githubusercontent.com/nickdelgrosso/CodeTeachingMaterials/main/datasets/worldbankdata.csv')
.get(['Country Name', 'Country Code', '1960', '1970', '1980', '1990', '2000'])
.sample(10)
.reset_index(drop=True)
)
df.head()| Country Name | Country Code | 1960 | 1970 | 1980 | 1990 | 2000 | |
|---|---|---|---|---|---|---|---|
| 0 | Rwanda | RWA | 8.187 | 8.231 | 8.448 | 7.271 | 5.897 |
| 1 | Other small states | OSS | 5.841 | 5.710 | 5.487 | 5.012 | 4.236 |
| 2 | Pakistan | PAK | 6.600 | 6.601 | 6.535 | 6.024 | 4.474 |
| 3 | Russian Federation | RUS | 2.520 | 1.990 | 1.890 | 1.890 | 1.210 |
| 4 | Palau | PLW | NaN | NaN | NaN | 2.757 | 1.535 |
Exercise: Melt this dataset so it has four columns: “Country Name”, “Country Code”, “Year”, and “Fertility Rate”
Solution
df2 = pd.melt(df,
id_vars = ['Country Name', 'Country Code'],
value_vars = ['1960', '1970', '1980', '1990', '2000'],
var_name='Decade',
value_name='Birth Rate',
)df2| Country Name | Country Code | Decade | Birth Rate | |
|---|---|---|---|---|
| 0 | Rwanda | RWA | 1960 | 8.187 |
| 1 | Other small states | OSS | 1960 | 5.841 |
| 2 | Pakistan | PAK | 1960 | 6.600 |
| 3 | Russian Federation | RUS | 1960 | 2.520 |
| 4 | Palau | PLW | 1960 | NaN |
| 5 | Mozambique | MOZ | 1960 | 6.600 |
| 6 | New Caledonia | NCL | 1960 | 6.278 |
| 7 | American Samoa | ASM | 1960 | NaN |
| 8 | Czech Republic | CZE | 1960 | 2.075 |
| 9 | Liechtenstein | LIE | 1960 | NaN |
| 10 | Rwanda | RWA | 1970 | 8.231 |
| 11 | Other small states | OSS | 1970 | 5.710 |
| 12 | Pakistan | PAK | 1970 | 6.601 |
| 13 | Russian Federation | RUS | 1970 | 1.990 |
| 14 | Palau | PLW | 1970 | NaN |
| 15 | Mozambique | MOZ | 1970 | 6.591 |
| 16 | New Caledonia | NCL | 1970 | 4.300 |
| 17 | American Samoa | ASM | 1970 | NaN |
| 18 | Czech Republic | CZE | 1970 | 1.929 |
| 19 | Liechtenstein | LIE | 1970 | NaN |
| 20 | Rwanda | RWA | 1980 | 8.448 |
| 21 | Other small states | OSS | 1980 | 5.487 |
| 22 | Pakistan | PAK | 1980 | 6.535 |
| 23 | Russian Federation | RUS | 1980 | 1.890 |
| 24 | Palau | PLW | 1980 | NaN |
| 25 | Mozambique | MOZ | 1980 | 6.490 |
| 26 | New Caledonia | NCL | 1980 | 3.256 |
| 27 | American Samoa | ASM | 1980 | NaN |
| 28 | Czech Republic | CZE | 1980 | 2.070 |
| 29 | Liechtenstein | LIE | 1980 | NaN |
| 30 | Rwanda | RWA | 1990 | 7.271 |
| 31 | Other small states | OSS | 1990 | 5.012 |
| 32 | Pakistan | PAK | 1990 | 6.024 |
| 33 | Russian Federation | RUS | 1990 | 1.890 |
| 34 | Palau | PLW | 1990 | 2.757 |
| 35 | Mozambique | MOZ | 1990 | 6.237 |
| 36 | New Caledonia | NCL | 1990 | 3.190 |
| 37 | American Samoa | ASM | 1990 | NaN |
| 38 | Czech Republic | CZE | 1990 | 1.893 |
| 39 | Liechtenstein | LIE | 1990 | NaN |
| 40 | Rwanda | RWA | 2000 | 5.897 |
| 41 | Other small states | OSS | 2000 | 4.236 |
| 42 | Pakistan | PAK | 2000 | 4.474 |
| 43 | Russian Federation | RUS | 2000 | 1.210 |
| 44 | Palau | PLW | 2000 | 1.535 |
| 45 | Mozambique | MOZ | 2000 | 5.782 |
| 46 | New Caledonia | NCL | 2000 | 2.630 |
| 47 | American Samoa | ASM | 2000 | NaN |
| 48 | Czech Republic | CZE | 2000 | 1.140 |
| 49 | Liechtenstein | LIE | 2000 | 1.570 |