Reorganizing Data in DataFrames

Authors
Dr. Nicholas Del Grosso | Dr. Sangeetha Nandakumar | Dr. Ole Bialas | Dr. Atle E. Rimehaug

Setup

Import Libraries

import numpy as np
import pandas as pd

Download dataset

import owncloud

owncloud.Client.from_public_link('https://uni-bonn.sciebo.de/s/TfFLatD2ApJEx5M', folder_password='ibots').get_file('/', 'worldbankdata.csv')
True

Section 1: Concatenation / Appends

The pd.concat() function and DataFrame.append() method take DataFrames with identical columns and make a DataFrame that is taller than either of them by stacking them on top of each other.

For example, it can turn this df1 DataFrame:

Day Weather
Monday Sunny
Tuesday Rainy

and this df2 DataFrame:

Day Weather
Wednesday Sunny
Thursday Rainy

into this:

Day Weather
Monday Sunny
Tuesday Rainy
Wednesday Sunny
Thursday Rainy

with one line of code:

pd.concat([df1, df2])

Note: If you’d like pandas to ignore the index of the dataframes when appending them, the following option is helpful:

pd.concat([df1, df2], ignore_index=True)

Note: With the pd.concat() function, you can concatenate as many dataframes in one step as you want!

Code Description
pd.concat([df1, df2]) Concatenate DataFrames vertically (stack them on top of each other).
pd.concat([df1, df2], ignore_index=True) Concatenate DataFrames and reset the index to 0, 1, 2, …
pd.concat([df1, df2, df3]) Concatenate multiple DataFrames in a single operation.

Run the cells below to create three DataFrames that we’ll use to practice concatenating DataFrames with the pd.concat function.

Exercises

df1 = pd.DataFrame({'Name': ['Jackson', 'Josh', 'Jenny'], 'Age': [16, 19, 17]})
df1
Name Age
0 Jackson 16
1 Josh 19
2 Jenny 17
df2 = pd.DataFrame({'Name': ['Jess', 'Jon', 'Joe'], 'Age': [21, 16, 23]}, index=[3, 4, 5])
df2
Name Age
3 Jess 21
4 Jon 16
5 Joe 23
df3 = pd.DataFrame({'Name': ['Jill', 'Josephine', 'Jack'], 'Age': [5, 10, 61]})
df3
Name Age
0 Jill 5
1 Josephine 10
2 Jack 61

Exercise: Concatenate df1 and df2 together!

Solution
pd.concat([df1, df2])
Name Age
0 Jackson 16
1 Josh 19
2 Jenny 17
3 Jess 21
4 Jon 16
5 Joe 23

Exercise: Concatenate df2 and df3 together!

Solution
pd.concat([df2, df3])
Name Age
3 Jess 21
4 Jon 16
5 Joe 23
0 Jill 5
1 Josephine 10
2 Jack 61

Exercise: Concatenate all three dataframes in a single line.

Solution
pd.concat([df1, df2, df3])
Name Age
0 Jackson 16
1 Josh 19
2 Jenny 17
3 Jess 21
4 Jon 16
5 Joe 23
0 Jill 5
1 Josephine 10
2 Jack 61

Exercise: The index in these datasets is unlabelled, indicating that they potentially don’t contain useful data. Concatenate them all together so that the index of the final dataframe is simply 0-8.

Solution
pd.concat([df1, df2, df3], ignore_index=True)
Name Age
0 Jackson 16
1 Josh 19
2 Jenny 17
3 Jess 21
4 Jon 16
5 Joe 23
6 Jill 5
7 Josephine 10
8 Jack 61

Section 2: Merge / Joins

The pd.merge() function and DataFrame.join() method take two DataFrames and make them wider by matching rows with the same values on a specified column.

For example, it can turn this df1 DataFrame:

Day Weather
Monday Sunny
Tuesday Rainy

and this df2 DataFrame:

Day Temperature
Tuesday 12
Monday 18

into this:

Day Weather Temperature
Monday Sunny 18
Tuesday Rainy 12

with one line of code:

df_merged = pd.merge(left=df1, right=df2, left_on="Day", right_on="Day")

Just specify which columns should be matched up with each other, and it will search for the matching values automatically! If you want it to use the index, you can alternatively supply the option left_index=True and/or right_index=True.

Code Description
pd.merge(left=df1, right=df2, left_on='col1', right_on='col2') Merge two DataFrames based on matching column values.
pd.merge(left=df1, right=df2, left_index=True, right_index=True) Merge two DataFrames using their indices.
pd.merge(left=df1, right=df2, on='col', how='inner') Inner join: keep only rows that match in both DataFrames.
pd.merge(left=df1, right=df2, on='col', how='outer') Outer join: keep all rows from both DataFrames, fill with NaN where no match.
pd.merge(left=df1, right=df2, on='col', how='left') Left join: keep all rows from left DataFrame, fill with NaN where no match in right.
pd.merge(left=df1, right=df2, on='col', how='right') Right join: keep all rows from right DataFrame, fill with NaN where no match in left.

Run the cells below to create three dataframes that we’ll use to practice merging with pd.merge().

Exercises

df1 = pd.DataFrame({'Name': ['Paul', 'Arash', 'Jenny'], 'Age': [16, 19, 17]})
df1
Name Age
0 Paul 16
1 Arash 19
2 Jenny 17

Dataframe 2:

df2 = pd.DataFrame({'Name': ['Arash', 'Paul', 'Sara'], 'Weight': [32, 15, 37]})
df2
Name Weight
0 Arash 32
1 Paul 15
2 Sara 37

Dataframe 3:

df3 = pd.DataFrame({'Name': ['Amy', 'Paul', 'Sara'], 'Height': [170, 190, 143]})
df3
Name Height
0 Amy 170
1 Paul 190
2 Sara 143

Exercise: Merge the first two dataframes together. Who do we know both the age and weight of?

Solution
pd.merge(left = df1, right = df2, left_on = 'Name', right_on='Name')
Name Age Weight
0 Paul 16 15
1 Arash 19 32

Exercise: Try merging all 3 by merging twice. Who do we know everything about?

Solution
df_merge1 = pd.merge(left = df1, right = df2, left_on = 'Name', right_on='Name')
df = pd.merge(left=df_merge1, right = df3, left_on = 'Name', right_on = 'Name')
df
Name Age Weight Height
0 Paul 16 15 190

Exercise: Note that the Names that weren’t present in both dataframes dropped out of the final result. If you’d like to keep them and have NaNs appear, you can change the how parameter in the pd.merge() function. Let’s try out a few options by merging dataframes 1 and 2. Use how="outer".

Solution
pd.merge(left = df1, right = df2, left_on = 'Name', right_on = 'Name', how = "outer")
Name Age Weight
0 Arash 19.0 32.0
1 Jenny 17.0 NaN
2 Paul 16.0 15.0
3 Sara NaN 37.0

Exercise: Use how="left".

Solution
pd.merge(left = df1, right = df2, left_on = 'Name', right_on = 'Name', how = "left")
Name Age Weight
0 Paul 16 15.0
1 Arash 19 32.0
2 Jenny 17 NaN

Exercise: Use how="right".

Solution
pd.merge(left = df1, right = df2, left_on = 'Name', right_on = 'Name', how = "right")
Name Age Weight
0 Arash 19.0 32
1 Paul 16.0 15
2 Sara NaN 37

Exercise: Use how="inner".

Solution
pd.merge(left = df1, right = df2, left_on = 'Name', right_on = 'Name', how = "inner")
Name Age Weight
0 Paul 16 15
1 Arash 19 32

Recognizing that multiple inner joins can result in high data attrition, what policies would you put in your future data analyses to both prevent data loss and keep data easy to analyze?

Section 3: Melts

The pd.melt() function and DataFrame.melt() method take a single dataframe and make it taller by taking data stored in column names and putting it into the rows along with the rest of the data, adding extra metadata in the process.

For example, it can turn this df DataFrame:

Month Year Monday Tuesday Wednesday
January 2021 0 -2 -1
February 2021 2 4 -2

into this:

Month Year Weekday Temperature
January 2021 Monday 0
January 2021 Tuesday -2
January 2021 Wednesday -1
February 2021 Monday 2
February 2021 Tuesday 4
February 2021 Wednesday -2

with one line of code:

pd.melt(
    df, 
    id_vars=['Month', 'Year'],  # The columns that should stay the same
    value_vars=['Monday', 'Tuesday', 'Wednesday'],   # The columns that should melt
    var_name='Weekday',  # The new Column that will represent the melted column name's variable
    value_name='Temperature'  # The new Column that the data represents
)

Note: Melting a dataframe is also called “tidying” data, making a “long” dataframe from a “wide” dataframe, or building a design matrix.

Code Description
pd.melt(df, id_vars=['col1'], value_vars=['col2', 'col3']) Melt a DataFrame by specifying which columns to keep (id_vars) and which to melt (value_vars).
pd.melt(df, id_vars=['col1'], var_name='variable', value_name='value') Melt a DataFrame and specify custom names for the resulting variable and value columns.
df.reset_index(drop=True) Reset the DataFrame index to 0, 1, 2, … and drop the old index.

Let’s practice tidying dataframes with the pd.melt() function. Run the cell below to create a dataframe df containing data from the world bank dataset.

Exercises

df = (
    pd.read_csv('worldbankdata.csv')
    .get(['Country Name', 'Country Code', '1960', '1970', '1980', '1990', '2000'])
    .sample(10, random_state=42)
    .reset_index(drop=True)
)
df.head()
Country Name Country Code 1960 1970 1980 1990 2000
0 Peru PER 6.876 6.307 5.009 3.828 2.929
1 Israel ISR 3.866 3.778 3.242 2.824 2.950
2 Congo, Dem. Rep. COD 6.001 6.214 6.585 7.133 7.089
3 Zambia ZMB 7.018 7.437 7.182 6.468 6.071
4 Benin BEN 6.282 6.748 7.025 6.744 5.975

Exercise: Melt this dataset so it has four columns: “Country Name”, “Country Code”, “Year”, and “Fertility Rate”

Solution
df2 = pd.melt(df,
        id_vars = ['Country Name', 'Country Code'],
        value_vars = ['1960', '1970', '1980', '1990', '2000'],
        var_name='Decade',
        value_name='Birth Rate',
        )
df2
Country Name Country Code Decade Birth Rate
0 Peru PER 1960 6.876
1 Israel ISR 1960 3.866
2 Congo, Dem. Rep. COD 1960 6.001
3 Zambia ZMB 1960 7.018
4 Benin BEN 1960 6.282
5 Korea, Rep. KOR 1960 6.155
6 Senegal SEN 1960 6.951
7 Venezuela, RB VEN 1960 6.616
8 Greece GRC 1960 2.277
9 New Caledonia NCL 1960 6.278
10 Peru PER 1970 6.307
11 Israel ISR 1970 3.778
12 Congo, Dem. Rep. COD 1970 6.214
13 Zambia ZMB 1970 7.437
14 Benin BEN 1970 6.748
15 Korea, Rep. KOR 1970 4.530
16 Senegal SEN 1970 7.339
17 Venezuela, RB VEN 1970 5.404
18 Greece GRC 1970 2.337
19 New Caledonia NCL 1970 4.300
20 Peru PER 1980 5.009
21 Israel ISR 1980 3.242
22 Congo, Dem. Rep. COD 1980 6.585
23 Zambia ZMB 1980 7.182
24 Benin BEN 1980 7.025
25 Korea, Rep. KOR 1980 2.830
26 Senegal SEN 1980 7.382
27 Venezuela, RB VEN 1980 4.199
28 Greece GRC 1980 2.226
29 New Caledonia NCL 1980 3.256
30 Peru PER 1990 3.828
31 Israel ISR 1990 2.824
32 Congo, Dem. Rep. COD 1990 7.133
33 Zambia ZMB 1990 6.468
34 Benin BEN 1990 6.744
35 Korea, Rep. KOR 1990 1.590
36 Senegal SEN 1990 6.629
37 Venezuela, RB VEN 1990 3.448
38 Greece GRC 1990 1.400
39 New Caledonia NCL 1990 3.190
40 Peru PER 2000 2.929
41 Israel ISR 2000 2.950
42 Congo, Dem. Rep. COD 2000 7.089
43 Zambia ZMB 2000 6.071
44 Benin BEN 2000 5.975
45 Korea, Rep. KOR 2000 1.470
46 Senegal SEN 2000 5.561
47 Venezuela, RB VEN 2000 2.822
48 Greece GRC 2000 1.260
49 New Caledonia NCL 2000 2.630