iBOTS Learn: Reorganizing Data in DataFrames

Course Catalog

Intro to Python for Scientists

Tabular Data

Reorganizing Data in DataFrames

Authors

Dr. Nicholas Del Grosso | Dr. Sangeetha Nandakumar | Dr. Ole Bialas | Dr. Atle E. Rimehaug

Open in JupyterLite Download Materials

Setup

Import Libraries

import numpy as np
import pandas as pd

Download dataset

import owncloud

owncloud.Client.from_public_link('https://uni-bonn.sciebo.de/s/TfFLatD2ApJEx5M', folder_password='ibots').get_file('/', 'worldbankdata.csv')

True

Section 1: Concatenation / Appends

The pd.concat() function and DataFrame.append() method take DataFrames with identical columns and make a DataFrame that is taller than either of them by stacking them on top of each other.

For example, it can turn this df1 DataFrame:

Day	Weather
Monday	Sunny
Tuesday	Rainy

and this df2 DataFrame:

Day	Weather
Wednesday	Sunny
Thursday	Rainy

into this:

Day	Weather
Monday	Sunny
Tuesday	Rainy
Wednesday	Sunny
Thursday	Rainy

with one line of code:

pd.concat([df1, df2])

Note: If you’d like pandas to ignore the index of the dataframes when appending them, the following option is helpful:

pd.concat([df1, df2], ignore_index=True)

Note: With the pd.concat() function, you can concatenate as many dataframes in one step as you want!

Code	Description
`pd.concat([df1, df2])`	Concatenate DataFrames vertically (stack them on top of each other).
`pd.concat([df1, df2], ignore_index=True)`	Concatenate DataFrames and reset the index to 0, 1, 2, …
`pd.concat([df1, df2, df3])`	Concatenate multiple DataFrames in a single operation.

Run the cells below to create three DataFrames that we’ll use to practice concatenating DataFrames with the pd.concat function.

Exercises

df1 = pd.DataFrame({'Name': ['Jackson', 'Josh', 'Jenny'], 'Age': [16, 19, 17]})
df1

	Name	Age
0	Jackson	16
1	Josh	19
2	Jenny	17

df2 = pd.DataFrame({'Name': ['Jess', 'Jon', 'Joe'], 'Age': [21, 16, 23]}, index=[3, 4, 5])
df2

	Name	Age
3	Jess	21
4	Jon	16
5	Joe	23

df3 = pd.DataFrame({'Name': ['Jill', 'Josephine', 'Jack'], 'Age': [5, 10, 61]})
df3

	Name	Age
0	Jill	5
1	Josephine	10
2	Jack	61

Exercise: Concatenate df1 and df2 together!

Solution

pd.concat([df1, df2])

	Name	Age
0	Jackson	16
1	Josh	19
2	Jenny	17
3	Jess	21
4	Jon	16
5	Joe	23

Exercise: Concatenate df2 and df3 together!

Solution

pd.concat([df2, df3])

	Name	Age
3	Jess	21
4	Jon	16
5	Joe	23
0	Jill	5
1	Josephine	10
2	Jack	61

Exercise: Concatenate all three dataframes in a single line.

Solution

pd.concat([df1, df2, df3])

	Name	Age
0	Jackson	16
1	Josh	19
2	Jenny	17
3	Jess	21
4	Jon	16
5	Joe	23
0	Jill	5
1	Josephine	10
2	Jack	61

Exercise: The index in these datasets is unlabelled, indicating that they potentially don’t contain useful data. Concatenate them all together so that the index of the final dataframe is simply 0-8.

Solution

pd.concat([df1, df2, df3], ignore_index=True)

	Name	Age
0	Jackson	16
1	Josh	19
2	Jenny	17
3	Jess	21
4	Jon	16
5	Joe	23
6	Jill	5
7	Josephine	10
8	Jack	61

Section 2: Merge / Joins

The pd.merge() function and DataFrame.join() method take two DataFrames and make them wider by matching rows with the same values on a specified column.

For example, it can turn this df1 DataFrame:

Day	Weather
Monday	Sunny
Tuesday	Rainy

and this df2 DataFrame:

Day	Temperature
Tuesday	12
Monday	18

into this:

Day	Weather	Temperature
Monday	Sunny	18
Tuesday	Rainy	12

with one line of code:

df_merged = pd.merge(left=df1, right=df2, left_on="Day", right_on="Day")

Just specify which columns should be matched up with each other, and it will search for the matching values automatically! If you want it to use the index, you can alternatively supply the option left_index=True and/or right_index=True.

Code	Description
`pd.merge(left=df1, right=df2, left_on='col1', right_on='col2')`	Merge two DataFrames based on matching column values.
`pd.merge(left=df1, right=df2, left_index=True, right_index=True)`	Merge two DataFrames using their indices.
`pd.merge(left=df1, right=df2, on='col', how='inner')`	Inner join: keep only rows that match in both DataFrames.
`pd.merge(left=df1, right=df2, on='col', how='outer')`	Outer join: keep all rows from both DataFrames, fill with NaN where no match.
`pd.merge(left=df1, right=df2, on='col', how='left')`	Left join: keep all rows from left DataFrame, fill with NaN where no match in right.
`pd.merge(left=df1, right=df2, on='col', how='right')`	Right join: keep all rows from right DataFrame, fill with NaN where no match in left.

Run the cells below to create three dataframes that we’ll use to practice merging with pd.merge().

Exercises

df1 = pd.DataFrame({'Name': ['Paul', 'Arash', 'Jenny'], 'Age': [16, 19, 17]})
df1

	Name	Age
0	Paul	16
1	Arash	19
2	Jenny	17

Dataframe 2:

df2 = pd.DataFrame({'Name': ['Arash', 'Paul', 'Sara'], 'Weight': [32, 15, 37]})
df2

	Name	Weight
0	Arash	32
1	Paul	15
2	Sara	37

Dataframe 3:

df3 = pd.DataFrame({'Name': ['Amy', 'Paul', 'Sara'], 'Height': [170, 190, 143]})
df3

	Name	Height
0	Amy	170
1	Paul	190
2	Sara	143

Exercise: Merge the first two dataframes together. Who do we know both the age and weight of?

Solution

pd.merge(left = df1, right = df2, left_on = 'Name', right_on='Name')

	Name	Age	Weight
0	Paul	16	15
1	Arash	19	32

Exercise: Try merging all 3 by merging twice. Who do we know everything about?

Solution

df_merge1 = pd.merge(left = df1, right = df2, left_on = 'Name', right_on='Name')
df = pd.merge(left=df_merge1, right = df3, left_on = 'Name', right_on = 'Name')
df

	Name	Age	Weight	Height
0	Paul	16	15	190

Exercise: Note that the Names that weren’t present in both dataframes dropped out of the final result. If you’d like to keep them and have NaNs appear, you can change the how parameter in the pd.merge() function. Let’s try out a few options by merging dataframes 1 and 2. Use how="outer".

Solution

pd.merge(left = df1, right = df2, left_on = 'Name', right_on = 'Name', how = "outer")

	Name	Age	Weight
0	Arash	19.0	32.0
1	Jenny	17.0	NaN
2	Paul	16.0	15.0
3	Sara	NaN	37.0

Exercise: Use how="left".

Solution

pd.merge(left = df1, right = df2, left_on = 'Name', right_on = 'Name', how = "left")

	Name	Age	Weight
0	Paul	16	15.0
1	Arash	19	32.0
2	Jenny	17	NaN

Exercise: Use how="right".

Solution

pd.merge(left = df1, right = df2, left_on = 'Name', right_on = 'Name', how = "right")

	Name	Age	Weight
0	Arash	19.0	32
1	Paul	16.0	15
2	Sara	NaN	37

Exercise: Use how="inner".

Solution

pd.merge(left = df1, right = df2, left_on = 'Name', right_on = 'Name', how = "inner")

	Name	Age	Weight
0	Paul	16	15
1	Arash	19	32

Recognizing that multiple inner joins can result in high data attrition, what policies would you put in your future data analyses to both prevent data loss and keep data easy to analyze?

Section 3: Melts

The pd.melt() function and DataFrame.melt() method take a single dataframe and make it taller by taking data stored in column names and putting it into the rows along with the rest of the data, adding extra metadata in the process.

For example, it can turn this df DataFrame:

Month	Year	Monday	Tuesday	Wednesday
January	2021	0	-2	-1
February	2021	2	4	-2

into this:

Month	Year	Weekday	Temperature
January	2021	Monday	0
January	2021	Tuesday	-2
January	2021	Wednesday	-1
February	2021	Monday	2
February	2021	Tuesday	4
February	2021	Wednesday	-2

with one line of code:

pd.melt(
    df, 
    id_vars=['Month', 'Year'],  # The columns that should stay the same
    value_vars=['Monday', 'Tuesday', 'Wednesday'],   # The columns that should melt
    var_name='Weekday',  # The new Column that will represent the melted column name's variable
    value_name='Temperature'  # The new Column that the data represents
)

Note: Melting a dataframe is also called “tidying” data, making a “long” dataframe from a “wide” dataframe, or building a design matrix.

Code	Description
`pd.melt(df, id_vars=['col1'], value_vars=['col2', 'col3'])`	Melt a DataFrame by specifying which columns to keep (`id_vars`) and which to melt (`value_vars`).
`pd.melt(df, id_vars=['col1'], var_name='variable', value_name='value')`	Melt a DataFrame and specify custom names for the resulting variable and value columns.
`df.reset_index(drop=True)`	Reset the DataFrame index to 0, 1, 2, … and drop the old index.

Let’s practice tidying dataframes with the pd.melt() function. Run the cell below to create a dataframe df containing data from the world bank dataset.

Exercises

df = (
    pd.read_csv('worldbankdata.csv')
    .get(['Country Name', 'Country Code', '1960', '1970', '1980', '1990', '2000'])
    .sample(10, random_state=42)
    .reset_index(drop=True)
)
df.head()

	Country Name	Country Code	1960	1970	1980	1990	2000
0	Peru	PER	6.876	6.307	5.009	3.828	2.929
1	Israel	ISR	3.866	3.778	3.242	2.824	2.950
2	Congo, Dem. Rep.	COD	6.001	6.214	6.585	7.133	7.089
3	Zambia	ZMB	7.018	7.437	7.182	6.468	6.071
4	Benin	BEN	6.282	6.748	7.025	6.744	5.975

Exercise: Melt this dataset so it has four columns: “Country Name”, “Country Code”, “Year”, and “Fertility Rate”

Solution

df2 = pd.melt(df,
        id_vars = ['Country Name', 'Country Code'],
        value_vars = ['1960', '1970', '1980', '1990', '2000'],
        var_name='Decade',
        value_name='Birth Rate',
        )

df2

	Country Name	Country Code	Decade	Birth Rate
0	Peru	PER	1960	6.876
1	Israel	ISR	1960	3.866
2	Congo, Dem. Rep.	COD	1960	6.001
3	Zambia	ZMB	1960	7.018
4	Benin	BEN	1960	6.282
5	Korea, Rep.	KOR	1960	6.155
6	Senegal	SEN	1960	6.951
7	Venezuela, RB	VEN	1960	6.616
8	Greece	GRC	1960	2.277
9	New Caledonia	NCL	1960	6.278
10	Peru	PER	1970	6.307
11	Israel	ISR	1970	3.778
12	Congo, Dem. Rep.	COD	1970	6.214
13	Zambia	ZMB	1970	7.437
14	Benin	BEN	1970	6.748
15	Korea, Rep.	KOR	1970	4.530
16	Senegal	SEN	1970	7.339
17	Venezuela, RB	VEN	1970	5.404
18	Greece	GRC	1970	2.337
19	New Caledonia	NCL	1970	4.300
20	Peru	PER	1980	5.009
21	Israel	ISR	1980	3.242
22	Congo, Dem. Rep.	COD	1980	6.585
23	Zambia	ZMB	1980	7.182
24	Benin	BEN	1980	7.025
25	Korea, Rep.	KOR	1980	2.830
26	Senegal	SEN	1980	7.382
27	Venezuela, RB	VEN	1980	4.199
28	Greece	GRC	1980	2.226
29	New Caledonia	NCL	1980	3.256
30	Peru	PER	1990	3.828
31	Israel	ISR	1990	2.824
32	Congo, Dem. Rep.	COD	1990	7.133
33	Zambia	ZMB	1990	6.468
34	Benin	BEN	1990	6.744
35	Korea, Rep.	KOR	1990	1.590
36	Senegal	SEN	1990	6.629
37	Venezuela, RB	VEN	1990	3.448
38	Greece	GRC	1990	1.400
39	New Caledonia	NCL	1990	3.190
40	Peru	PER	2000	2.929
41	Israel	ISR	2000	2.950
42	Congo, Dem. Rep.	COD	2000	7.089
43	Zambia	ZMB	2000	6.071
44	Benin	BEN	2000	5.975
45	Korea, Rep.	KOR	2000	1.470
46	Senegal	SEN	2000	5.561
47	Venezuela, RB	VEN	2000	2.822
48	Greece	GRC	2000	1.260
49	New Caledonia	NCL	2000	2.630