Reorganizing Data in DataFrames

Reorganizing Data in DataFrames

Authors
Dr. Nicholas Del Grosso | Dr. Sangeetha Nandakumar | Dr. Ole Bialas | Dr. Atle E. Rimehaug

Setup

Import Libraries

import numpy as np
import pandas as pd

Section 1: Concatenation / Appends

The pd.concat() function and DataFrame.append() method take DataFrames with identical columns and make a DataFrame that is taller than either of them by stacking them on top of each other.

For example, it can turn this df1 DataFrame:

Day Weather
Monday Sunny
Tuesday Rainy

and this df2 DataFrame:

Day Weather
Wednesday Sunny
Thursday Rainy

into this:

Day Weather
Monday Sunny
Tuesday Rainy
Wednesday Sunny
Thursday Rainy

with one line of code:

pd.concat([df1, df2])

Note: If you’d like pandas to ignore the index of the dataframes when appending them, the following option is helpful:

pd.concat([df1, df2], ignore_index=True)

Note: With the pd.concat() function, you can concatenate as many dataframes in one step as you want!

Code Description
pd.concat([df1, df2]) Concatenate DataFrames vertically (stack them on top of each other).
pd.concat([df1, df2], ignore_index=True) Concatenate DataFrames and reset the index to 0, 1, 2, …
pd.concat([df1, df2, df3]) Concatenate multiple DataFrames in a single operation.

Run the cells below to create three DataFrames that we’ll use to practice concatenating DataFrames with the pd.concat function.

Exercises

df1 = pd.DataFrame({'Name': ['Jackson', 'Josh', 'Jenny'], 'Age': [16, 19, 17]})
df1

Name Age
0 Jackson 16
1 Josh 19
2 Jenny 17
df2 = pd.DataFrame({'Name': ['Jess', 'Jon', 'Joe'], 'Age': [21, 16, 23]}, index=[3, 4, 5])
df2

Name Age
3 Jess 21
4 Jon 16
5 Joe 23
df3 = pd.DataFrame({'Name': ['Jill', 'Josephine', 'Jack'], 'Age': [5, 10, 61]})
df3

Name Age
0 Jill 5
1 Josephine 10
2 Jack 61

Exercise: Concatenate df1 and df2 together!

Solution
pd.concat([df1, df2])

Name Age
0 Jackson 16
1 Josh 19
2 Jenny 17
3 Jess 21
4 Jon 16
5 Joe 23

Exercise: Concatenate df2 and df3 together!

Solution
pd.concat([df2, df3])

Name Age
3 Jess 21
4 Jon 16
5 Joe 23
0 Jill 5
1 Josephine 10
2 Jack 61

Exercise: Concatenate all three dataframes in a single line.

Solution
pd.concat([df1, df2, df3])

Name Age
0 Jackson 16
1 Josh 19
2 Jenny 17
3 Jess 21
4 Jon 16
5 Joe 23
0 Jill 5
1 Josephine 10
2 Jack 61

Exercise: The index in these datasets is unlabelled, indicating that they potentially don’t contain useful data. Concatenate them all together so that the index of the final dataframe is simply 0-8.

Solution
pd.concat([df1, df2, df3], ignore_index=True)

Name Age
0 Jackson 16
1 Josh 19
2 Jenny 17
3 Jess 21
4 Jon 16
5 Joe 23
6 Jill 5
7 Josephine 10
8 Jack 61

Section 2: Merge / Joins

The pd.merge() function and DataFrame.join() method take two DataFrames and make them wider by matching rows with the same values on a specified column.

For example, it can turn this df1 DataFrame:

Day Weather
Monday Sunny
Tuesday Rainy

and this df2 DataFrame:

Day Temperature
Tuesday 12
Monday 18

into this:

Day Weather Temperature
Monday Sunny 18
Tuesday Rainy 12

with one line of code:

df_merged = pd.merge(left=df1, right=df2, left_on="Day", right_on="Day")

Just specify which columns should be matched up with each other, and it will search for the matching values automatically! If you want it to use the index, you can alternatively supply the option left_index=True and/or right_index=True.

Code Description
pd.merge(left=df1, right=df2, left_on='col1', right_on='col2') Merge two DataFrames based on matching column values.
pd.merge(left=df1, right=df2, left_index=True, right_index=True) Merge two DataFrames using their indices.
pd.merge(left=df1, right=df2, on='col', how='inner') Inner join: keep only rows that match in both DataFrames.
pd.merge(left=df1, right=df2, on='col', how='outer') Outer join: keep all rows from both DataFrames, fill with NaN where no match.
pd.merge(left=df1, right=df2, on='col', how='left') Left join: keep all rows from left DataFrame, fill with NaN where no match in right.
pd.merge(left=df1, right=df2, on='col', how='right') Right join: keep all rows from right DataFrame, fill with NaN where no match in left.

Run the cells below to create three dataframes that we’ll use to practice merging with pd.merge().

Exercises

df1 = pd.DataFrame({'Name': ['Paul', 'Arash', 'Jenny'], 'Age': [16, 19, 17]})
df1

Name Age
0 Paul 16
1 Arash 19
2 Jenny 17

Dataframe 2:

df2 = pd.DataFrame({'Name': ['Arash', 'Paul', 'Sara'], 'Weight': [32, 15, 37]})
df2

Name Weight
0 Arash 32
1 Paul 15
2 Sara 37

Dataframe 3:

df3 = pd.DataFrame({'Name': ['Amy', 'Paul', 'Sara'], 'Height': [170, 190, 143]})
df3

Name Height
0 Amy 170
1 Paul 190
2 Sara 143

Exercise: Merge the first two dataframes together. Who do we know both the age and weight of?

Solution
pd.merge(left = df1, right = df2, left_on = 'Name', right_on='Name')

Name Age Weight
0 Paul 16 15
1 Arash 19 32

Exercise: Try merging all 3 by merging twice. Who do we know everything about?

Solution
df_merge1 = pd.merge(left = df1, right = df2, left_on = 'Name', right_on='Name')
df = pd.merge(left=df_merge1, right = df3, left_on = 'Name', right_on = 'Name')
df

Name Age Weight Height
0 Paul 16 15 190

Exercise: Note that the Names that weren’t present in both dataframes dropped out of the final result. If you’d like to keep them and have NaNs appear, you can change the how parameter in the pd.merge() function. Let’s try out a few options by merging dataframes 1 and 2. Use how="outer".

Solution
pd.merge(left = df1, right = df2, left_on = 'Name', right_on = 'Name', how = "outer")

Name Age Weight
0 Arash 19.0 32.0
1 Jenny 17.0 NaN
2 Paul 16.0 15.0
3 Sara NaN 37.0

Exercise: Use how="left".

Solution
pd.merge(left = df1, right = df2, left_on = 'Name', right_on = 'Name', how = "left")

Name Age Weight
0 Paul 16 15.0
1 Arash 19 32.0
2 Jenny 17 NaN

Exercise: Use how="right".

Solution
pd.merge(left = df1, right = df2, left_on = 'Name', right_on = 'Name', how = "right")

Name Age Weight
0 Arash 19.0 32
1 Paul 16.0 15
2 Sara NaN 37

Exercise: Use how="inner".

Solution
pd.merge(left = df1, right = df2, left_on = 'Name', right_on = 'Name', how = "inner")

Name Age Weight
0 Paul 16 15
1 Arash 19 32

Recognizing that multiple inner joins can result in high data attrition, what policies would you put in your future data analyses to both prevent data loss and keep data easy to analyze?

Section 3: Melts

The pd.melt() function and DataFrame.melt() method take a single dataframe and make it taller by taking data stored in column names and putting it into the rows along with the rest of the data, adding extra metadata in the process.

For example, it can turn this df DataFrame:

Month Year Monday Tuesday Wednesday
January 2021 0 -2 -1
February 2021 2 4 -2

into this:

Month Year Weekday Temperature
January 2021 Monday 0
January 2021 Tuesday -2
January 2021 Wednesday -1
February 2021 Monday 2
February 2021 Tuesday 4
February 2021 Wednesday -2

with one line of code:

pd.melt(
    df, 
    id_vars=['Month', 'Year'],  # The columns that should stay the same
    value_vars=['Monday', 'Tuesday', 'Wednesday'],   # The columns that should melt
    var_name='Weekday',  # The new Column that will represent the melted column name's variable
    value_name='Temperature'  # The new Column that the data represents
)

Note: Melting a dataframe is also called “tidying” data, making a “long” dataframe from a “wide” dataframe, or building a design matrix.

Code Description
pd.melt(df, id_vars=['col1'], value_vars=['col2', 'col3']) Melt a DataFrame by specifying which columns to keep (id_vars) and which to melt (value_vars).
pd.melt(df, id_vars=['col1'], var_name='variable', value_name='value') Melt a DataFrame and specify custom names for the resulting variable and value columns.
df.reset_index(drop=True) Reset the DataFrame index to 0, 1, 2, … and drop the old index.

Let’s practice tidying dataframes with the pd.melt() function. Run the cell below to create a dataframe df containing data from the world bank dataset.

Exercises

df = (
    pd.read_csv('https://raw.githubusercontent.com/nickdelgrosso/CodeTeachingMaterials/main/datasets/worldbankdata.csv')
    .get(['Country Name', 'Country Code', '1960', '1970', '1980', '1990', '2000'])
    .sample(10)
    .reset_index(drop=True)
)
df.head()

Country Name Country Code 1960 1970 1980 1990 2000
0 Rwanda RWA 8.187 8.231 8.448 7.271 5.897
1 Other small states OSS 5.841 5.710 5.487 5.012 4.236
2 Pakistan PAK 6.600 6.601 6.535 6.024 4.474
3 Russian Federation RUS 2.520 1.990 1.890 1.890 1.210
4 Palau PLW NaN NaN NaN 2.757 1.535

Exercise: Melt this dataset so it has four columns: “Country Name”, “Country Code”, “Year”, and “Fertility Rate”

Solution
df2 = pd.melt(df,
        id_vars = ['Country Name', 'Country Code'],
        value_vars = ['1960', '1970', '1980', '1990', '2000'],
        var_name='Decade',
        value_name='Birth Rate',
        )
df2

Country Name Country Code Decade Birth Rate
0 Rwanda RWA 1960 8.187
1 Other small states OSS 1960 5.841
2 Pakistan PAK 1960 6.600
3 Russian Federation RUS 1960 2.520
4 Palau PLW 1960 NaN
5 Mozambique MOZ 1960 6.600
6 New Caledonia NCL 1960 6.278
7 American Samoa ASM 1960 NaN
8 Czech Republic CZE 1960 2.075
9 Liechtenstein LIE 1960 NaN
10 Rwanda RWA 1970 8.231
11 Other small states OSS 1970 5.710
12 Pakistan PAK 1970 6.601
13 Russian Federation RUS 1970 1.990
14 Palau PLW 1970 NaN
15 Mozambique MOZ 1970 6.591
16 New Caledonia NCL 1970 4.300
17 American Samoa ASM 1970 NaN
18 Czech Republic CZE 1970 1.929
19 Liechtenstein LIE 1970 NaN
20 Rwanda RWA 1980 8.448
21 Other small states OSS 1980 5.487
22 Pakistan PAK 1980 6.535
23 Russian Federation RUS 1980 1.890
24 Palau PLW 1980 NaN
25 Mozambique MOZ 1980 6.490
26 New Caledonia NCL 1980 3.256
27 American Samoa ASM 1980 NaN
28 Czech Republic CZE 1980 2.070
29 Liechtenstein LIE 1980 NaN
30 Rwanda RWA 1990 7.271
31 Other small states OSS 1990 5.012
32 Pakistan PAK 1990 6.024
33 Russian Federation RUS 1990 1.890
34 Palau PLW 1990 2.757
35 Mozambique MOZ 1990 6.237
36 New Caledonia NCL 1990 3.190
37 American Samoa ASM 1990 NaN
38 Czech Republic CZE 1990 1.893
39 Liechtenstein LIE 1990 NaN
40 Rwanda RWA 2000 5.897
41 Other small states OSS 2000 4.236
42 Pakistan PAK 2000 4.474
43 Russian Federation RUS 2000 1.210
44 Palau PLW 2000 1.535
45 Mozambique MOZ 2000 5.782
46 New Caledonia NCL 2000 2.630
47 American Samoa ASM 2000 NaN
48 Czech Republic CZE 2000 1.140
49 Liechtenstein LIE 2000 1.570