Reorganizing Data in DataFrames

Courses

Intro to Python for Scientists

Tabular Data

Authors

Dr. Nicholas Del Grosso | Dr. Sangeetha Nandakumar | Dr. Ole Bialas | Dr. Atle E. Rimehaug

Open in JupyterLite Download Materials

Setup

Import Libraries

import numpy as np
import pandas as pd

Section 1: Concatenation / Appends

The pd.concat() function and DataFrame.append() method take DataFrames with identical columns and make a DataFrame that is taller than either of them by stacking them on top of each other.

For example, it can turn this df1 DataFrame:

Day	Weather
Monday	Sunny
Tuesday	Rainy

and this df2 DataFrame:

Day	Weather
Wednesday	Sunny
Thursday	Rainy

into this:

Day	Weather
Monday	Sunny
Tuesday	Rainy
Wednesday	Sunny
Thursday	Rainy

with one line of code:

pd.concat([df1, df2])

Note: If you’d like pandas to ignore the index of the dataframes when appending them, the following option is helpful:

pd.concat([df1, df2], ignore_index=True)

Note: With the pd.concat() function, you can concatenate as many dataframes in one step as you want!

Code	Description
`pd.concat([df1, df2])`	Concatenate DataFrames vertically (stack them on top of each other).
`pd.concat([df1, df2], ignore_index=True)`	Concatenate DataFrames and reset the index to 0, 1, 2, …
`pd.concat([df1, df2, df3])`	Concatenate multiple DataFrames in a single operation.

Run the cells below to create three DataFrames that we’ll use to practice concatenating DataFrames with the pd.concat function.

Exercises

df1 = pd.DataFrame({'Name': ['Jackson', 'Josh', 'Jenny'], 'Age': [16, 19, 17]})
df1

	Name	Age
0	Jackson	16
1	Josh	19
2	Jenny	17

df2 = pd.DataFrame({'Name': ['Jess', 'Jon', 'Joe'], 'Age': [21, 16, 23]}, index=[3, 4, 5])
df2

	Name	Age
3	Jess	21
4	Jon	16
5	Joe	23

df3 = pd.DataFrame({'Name': ['Jill', 'Josephine', 'Jack'], 'Age': [5, 10, 61]})
df3

	Name	Age
0	Jill	5
1	Josephine	10
2	Jack	61

Exercise: Concatenate df1 and df2 together!

Solution

pd.concat([df1, df2])

	Name	Age
0	Jackson	16
1	Josh	19
2	Jenny	17
3	Jess	21
4	Jon	16
5	Joe	23

Exercise: Concatenate df2 and df3 together!

Solution

pd.concat([df2, df3])

	Name	Age
3	Jess	21
4	Jon	16
5	Joe	23
0	Jill	5
1	Josephine	10
2	Jack	61

Exercise: Concatenate all three dataframes in a single line.

Solution

pd.concat([df1, df2, df3])

	Name	Age
0	Jackson	16
1	Josh	19
2	Jenny	17
3	Jess	21
4	Jon	16
5	Joe	23
0	Jill	5
1	Josephine	10
2	Jack	61

Exercise: The index in these datasets is unlabelled, indicating that they potentially don’t contain useful data. Concatenate them all together so that the index of the final dataframe is simply 0-8.

Solution

pd.concat([df1, df2, df3], ignore_index=True)

	Name	Age
0	Jackson	16
1	Josh	19
2	Jenny	17
3	Jess	21
4	Jon	16
5	Joe	23
6	Jill	5
7	Josephine	10
8	Jack	61

Section 2: Merge / Joins

The pd.merge() function and DataFrame.join() method take two DataFrames and make them wider by matching rows with the same values on a specified column.

For example, it can turn this df1 DataFrame:

Day	Weather
Monday	Sunny
Tuesday	Rainy

and this df2 DataFrame:

Day	Temperature
Tuesday	12
Monday	18

into this:

Day	Weather	Temperature
Monday	Sunny	18
Tuesday	Rainy	12

with one line of code:

df_merged = pd.merge(left=df1, right=df2, left_on="Day", right_on="Day")

Just specify which columns should be matched up with each other, and it will search for the matching values automatically! If you want it to use the index, you can alternatively supply the option left_index=True and/or right_index=True.

Code	Description
`pd.merge(left=df1, right=df2, left_on='col1', right_on='col2')`	Merge two DataFrames based on matching column values.
`pd.merge(left=df1, right=df2, left_index=True, right_index=True)`	Merge two DataFrames using their indices.
`pd.merge(left=df1, right=df2, on='col', how='inner')`	Inner join: keep only rows that match in both DataFrames.
`pd.merge(left=df1, right=df2, on='col', how='outer')`	Outer join: keep all rows from both DataFrames, fill with NaN where no match.
`pd.merge(left=df1, right=df2, on='col', how='left')`	Left join: keep all rows from left DataFrame, fill with NaN where no match in right.
`pd.merge(left=df1, right=df2, on='col', how='right')`	Right join: keep all rows from right DataFrame, fill with NaN where no match in left.

Run the cells below to create three dataframes that we’ll use to practice merging with pd.merge().

Exercises

df1 = pd.DataFrame({'Name': ['Paul', 'Arash', 'Jenny'], 'Age': [16, 19, 17]})
df1

	Name	Age
0	Paul	16
1	Arash	19
2	Jenny	17

Dataframe 2:

df2 = pd.DataFrame({'Name': ['Arash', 'Paul', 'Sara'], 'Weight': [32, 15, 37]})
df2

	Name	Weight
0	Arash	32
1	Paul	15
2	Sara	37

Dataframe 3:

df3 = pd.DataFrame({'Name': ['Amy', 'Paul', 'Sara'], 'Height': [170, 190, 143]})
df3

	Name	Height
0	Amy	170
1	Paul	190
2	Sara	143

Exercise: Merge the first two dataframes together. Who do we know both the age and weight of?

Solution

pd.merge(left = df1, right = df2, left_on = 'Name', right_on='Name')

	Name	Age	Weight
0	Paul	16	15
1	Arash	19	32

Exercise: Try merging all 3 by merging twice. Who do we know everything about?

Solution

df_merge1 = pd.merge(left = df1, right = df2, left_on = 'Name', right_on='Name')
df = pd.merge(left=df_merge1, right = df3, left_on = 'Name', right_on = 'Name')
df

	Name	Age	Weight	Height
0	Paul	16	15	190

Exercise: Note that the Names that weren’t present in both dataframes dropped out of the final result. If you’d like to keep them and have NaNs appear, you can change the how parameter in the pd.merge() function. Let’s try out a few options by merging dataframes 1 and 2. Use how="outer".

Solution

pd.merge(left = df1, right = df2, left_on = 'Name', right_on = 'Name', how = "outer")

	Name	Age	Weight
0	Arash	19.0	32.0
1	Jenny	17.0	NaN
2	Paul	16.0	15.0
3	Sara	NaN	37.0

Exercise: Use how="left".

Solution

pd.merge(left = df1, right = df2, left_on = 'Name', right_on = 'Name', how = "left")

	Name	Age	Weight
0	Paul	16	15.0
1	Arash	19	32.0
2	Jenny	17	NaN

Exercise: Use how="right".

Solution

pd.merge(left = df1, right = df2, left_on = 'Name', right_on = 'Name', how = "right")

	Name	Age	Weight
0	Arash	19.0	32
1	Paul	16.0	15
2	Sara	NaN	37

Exercise: Use how="inner".

Solution

pd.merge(left = df1, right = df2, left_on = 'Name', right_on = 'Name', how = "inner")

	Name	Age	Weight
0	Paul	16	15
1	Arash	19	32

Recognizing that multiple inner joins can result in high data attrition, what policies would you put in your future data analyses to both prevent data loss and keep data easy to analyze?

Section 3: Melts

The pd.melt() function and DataFrame.melt() method take a single dataframe and make it taller by taking data stored in column names and putting it into the rows along with the rest of the data, adding extra metadata in the process.

For example, it can turn this df DataFrame:

Month	Year	Monday	Tuesday	Wednesday
January	2021	0	-2	-1
February	2021	2	4	-2

into this:

Month	Year	Weekday	Temperature
January	2021	Monday	0
January	2021	Tuesday	-2
January	2021	Wednesday	-1
February	2021	Monday	2
February	2021	Tuesday	4
February	2021	Wednesday	-2

with one line of code:

pd.melt(
    df, 
    id_vars=['Month', 'Year'],  # The columns that should stay the same
    value_vars=['Monday', 'Tuesday', 'Wednesday'],   # The columns that should melt
    var_name='Weekday',  # The new Column that will represent the melted column name's variable
    value_name='Temperature'  # The new Column that the data represents
)

Note: Melting a dataframe is also called “tidying” data, making a “long” dataframe from a “wide” dataframe, or building a design matrix.

Code	Description
`pd.melt(df, id_vars=['col1'], value_vars=['col2', 'col3'])`	Melt a DataFrame by specifying which columns to keep (`id_vars`) and which to melt (`value_vars`).
`pd.melt(df, id_vars=['col1'], var_name='variable', value_name='value')`	Melt a DataFrame and specify custom names for the resulting variable and value columns.
`df.reset_index(drop=True)`	Reset the DataFrame index to 0, 1, 2, … and drop the old index.

Let’s practice tidying dataframes with the pd.melt() function. Run the cell below to create a dataframe df containing data from the world bank dataset.

Exercises

df = (
    pd.read_csv('https://raw.githubusercontent.com/nickdelgrosso/CodeTeachingMaterials/main/datasets/worldbankdata.csv')
    .get(['Country Name', 'Country Code', '1960', '1970', '1980', '1990', '2000'])
    .sample(10)
    .reset_index(drop=True)
)
df.head()

	Country Name	Country Code	1960	1970	1980	1990	2000
0	Rwanda	RWA	8.187	8.231	8.448	7.271	5.897
1	Other small states	OSS	5.841	5.710	5.487	5.012	4.236
2	Pakistan	PAK	6.600	6.601	6.535	6.024	4.474
3	Russian Federation	RUS	2.520	1.990	1.890	1.890	1.210
4	Palau	PLW	NaN	NaN	NaN	2.757	1.535

Exercise: Melt this dataset so it has four columns: “Country Name”, “Country Code”, “Year”, and “Fertility Rate”

Solution

df2 = pd.melt(df,
        id_vars = ['Country Name', 'Country Code'],
        value_vars = ['1960', '1970', '1980', '1990', '2000'],
        var_name='Decade',
        value_name='Birth Rate',
        )

df2

	Country Name	Country Code	Decade	Birth Rate
0	Rwanda	RWA	1960	8.187
1	Other small states	OSS	1960	5.841
2	Pakistan	PAK	1960	6.600
3	Russian Federation	RUS	1960	2.520
4	Palau	PLW	1960	NaN
5	Mozambique	MOZ	1960	6.600
6	New Caledonia	NCL	1960	6.278
7	American Samoa	ASM	1960	NaN
8	Czech Republic	CZE	1960	2.075
9	Liechtenstein	LIE	1960	NaN
10	Rwanda	RWA	1970	8.231
11	Other small states	OSS	1970	5.710
12	Pakistan	PAK	1970	6.601
13	Russian Federation	RUS	1970	1.990
14	Palau	PLW	1970	NaN
15	Mozambique	MOZ	1970	6.591
16	New Caledonia	NCL	1970	4.300
17	American Samoa	ASM	1970	NaN
18	Czech Republic	CZE	1970	1.929
19	Liechtenstein	LIE	1970	NaN
20	Rwanda	RWA	1980	8.448
21	Other small states	OSS	1980	5.487
22	Pakistan	PAK	1980	6.535
23	Russian Federation	RUS	1980	1.890
24	Palau	PLW	1980	NaN
25	Mozambique	MOZ	1980	6.490
26	New Caledonia	NCL	1980	3.256
27	American Samoa	ASM	1980	NaN
28	Czech Republic	CZE	1980	2.070
29	Liechtenstein	LIE	1980	NaN
30	Rwanda	RWA	1990	7.271
31	Other small states	OSS	1990	5.012
32	Pakistan	PAK	1990	6.024
33	Russian Federation	RUS	1990	1.890
34	Palau	PLW	1990	2.757
35	Mozambique	MOZ	1990	6.237
36	New Caledonia	NCL	1990	3.190
37	American Samoa	ASM	1990	NaN
38	Czech Republic	CZE	1990	1.893
39	Liechtenstein	LIE	1990	NaN
40	Rwanda	RWA	2000	5.897
41	Other small states	OSS	2000	4.236
42	Pakistan	PAK	2000	4.474
43	Russian Federation	RUS	2000	1.210
44	Palau	PLW	2000	1.535
45	Mozambique	MOZ	2000	5.782
46	New Caledonia	NCL	2000	2.630
47	American Samoa	ASM	2000	NaN
48	Czech Republic	CZE	2000	1.140
49	Liechtenstein	LIE	2000	1.570