Processing Dirty Data

Background

This is fake data generated to demonstrate the capabilities of pyjanitor. It contains a bunch of common problems that we regularly encounter when working with data. Let's go fix it!

Load Packages

Importing pyjanitor is all that's needed to give Pandas Dataframes extra methods to work with your data.

import janitor
import matplotlib.pyplot as plt
import missingno as mn
import pandas as pd
from pyprojroot import here

%matplotlib inline

Load Data

df = pd.read_excel(here() / "data/dirty_data.xlsx", engine="openpyxl")
df

	First Name	Last Name	Employee Status	Subject	Hire Date	% Allocated	Full time?	do not edit! --->	Certification	Certification.1	Certification.2
0	Jason	Bourne	Teacher	PE	39690.0	0.75	Yes	NaN	Physical ed	Theater	NaN
1	Jason	Bourne	Teacher	Drafting	39690.0	0.25	Yes	NaN	Physical ed	Theater	NaN
2	Alicia	Keys	Teacher	Music	37118.0	1.00	Yes	NaN	Instr. music	Vocal music	NaN
3	Ada	Lovelace	Teacher	NaN	27515.0	1.00	Yes	NaN	PENDING	Computers	NaN
4	Desus	Nice	Administration	Dean	41431.0	1.00	Yes	NaN	PENDING	NaN	NaN
5	Chien-Shiung	Wu	Teacher	Physics	11037.0	0.50	Yes	NaN	Science 6-12	Physics	NaN
6	Chien-Shiung	Wu	Teacher	Chemistry	11037.0	0.50	Yes	NaN	Science 6-12	Physics	NaN
7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	James	Joyce	Teacher	English	32994.0	0.50	No	NaN	NaN	English 6-12	NaN
9	Hedy	Lamarr	Teacher	Science	27919.0	0.50	No	NaN	PENDING	NaN	NaN
10	Carlos	Boozer	Coach	Basketball	42221.0	NaN	No	NaN	Physical ed	NaN	NaN
11	Young	Boozer	Coach	NaN	34700.0	NaN	No	NaN	NaN	Political sci.	NaN
12	Micheal	Larsen	Teacher	English	40071.0	0.80	No	NaN	Vocal music	English	NaN

Visualizing the NaN Values

Quickly visualizing the NaN values with the package missingno can help us see how dirty the data may be and if the NaN values have any relationships between columns. And as we will see later, it can also be used to visualize the improvement afforded by pyjanitor.

Here is the current visual state of the data (the right side of the chart gives a vertical line graph of min/max row information):

mn.matrix(df);

Cleaning Column Names

There are a bunch of problems with this data. Firstly, the column names are not lowercase, and they have spaces. This will make it cumbersome to use in a programmatic function. To solve this, we can use the clean_names() method.

df_clean = df.clean_names()
df_clean.head(2)

	first_name	last_name	employee_status	subject	hire_date	%_allocated	full_time_	do_not_edit!_>	certification	certification_1	certification_2
0	Jason	Bourne	Teacher	PE	39690.0	0.75	Yes	NaN	Physical ed	Theater	NaN
1	Jason	Bourne	Teacher	Drafting	39690.0	0.25	Yes	NaN	Physical ed	Theater	NaN

Notice now how the column names have been made better.

If you squint at the unclean dataset, you'll notice one row and one column of data that are missing. We can also fix this! Building on top of the code block from above, let's now remove those empty columns using the remove_empty() method:

df_clean = df.clean_names().remove_empty()
df_clean.head(9).tail(4)

	first_name	last_name	employee_status	subject	hire_date	%_allocated	full_time_	certification	certification_1
5	Chien-Shiung	Wu	Teacher	Physics	11037.0	0.5	Yes	Science 6-12	Physics
6	Chien-Shiung	Wu	Teacher	Chemistry	11037.0	0.5	Yes	Science 6-12	Physics
7	James	Joyce	Teacher	English	32994.0	0.5	No	NaN	English 6-12
8	Hedy	Lamarr	Teacher	Science	27919.0	0.5	No	PENDING	NaN

Now this is starting to shape up well!

Renaming Individual Columns

Next, let's rename some of the columns. %_allocated and full_time? contain non-alphanumeric characters, so they make it a bit harder to use. We can rename them using the :py:meth:rename_column() method:

df_clean = (
    df.clean_names()
    .remove_empty()
    .rename_column("%_allocated", "percent_allocated")
    .rename_column("full_time_", "full_time")
)

df_clean.head(5)

	first_name	last_name	employee_status	subject	hire_date	percent_allocated	full_time	certification	certification_1
0	Jason	Bourne	Teacher	PE	39690.0	0.75	Yes	Physical ed	Theater
1	Jason	Bourne	Teacher	Drafting	39690.0	0.25	Yes	Physical ed	Theater
2	Alicia	Keys	Teacher	Music	37118.0	1.00	Yes	Instr. music	Vocal music
3	Ada	Lovelace	Teacher	NaN	27515.0	1.00	Yes	PENDING	Computers
4	Desus	Nice	Administration	Dean	41431.0	1.00	Yes	PENDING	NaN

Note how now we have really nice column names! You might be wondering why I'm not modifying the two certifiation columns -- that is the next thing we'll tackle.

Coalescing Columns

If we look more closely at the two certification columns, we'll see that they look like this:

df_clean[["certification", "certification_1"]]

	certification	certification_1
0	Physical ed	Theater
1	Physical ed	Theater
2	Instr. music	Vocal music
3	PENDING	Computers
4	PENDING	NaN
5	Science 6-12	Physics
6	Science 6-12	Physics
7	NaN	English 6-12
8	PENDING	NaN
9	Physical ed	NaN
10	NaN	Political sci.
11	Vocal music	English

Rows 8 and 11 have NaN in the left certification column, but have a value in the right certification column. Let's assume for a moment that the left certification column is intended to record the first certification that a teacher had obtained. In this case, the values in the right certification column on rows 8 and 11 should be moved to the first column. Let's do that with Janitor, using the coalesce() method, which does the following:

df_clean = (
    df.clean_names()
    .remove_empty()
    .rename_column("%_allocated", "percent_allocated")
    .rename_column("full_time_", "full_time")
    .coalesce("certification", "certification_1", new_column_name="certification")
)

df_clean

	first_name	last_name	employee_status	subject	hire_date	percent_allocated	full_time	certification	certification_1
0	Jason	Bourne	Teacher	PE	39690.0	0.75	Yes	Physical ed	Theater
1	Jason	Bourne	Teacher	Drafting	39690.0	0.25	Yes	Physical ed	Theater
2	Alicia	Keys	Teacher	Music	37118.0	1.00	Yes	Instr. music	Vocal music
3	Ada	Lovelace	Teacher	NaN	27515.0	1.00	Yes	PENDING	Computers
4	Desus	Nice	Administration	Dean	41431.0	1.00	Yes	PENDING	NaN
5	Chien-Shiung	Wu	Teacher	Physics	11037.0	0.50	Yes	Science 6-12	Physics
6	Chien-Shiung	Wu	Teacher	Chemistry	11037.0	0.50	Yes	Science 6-12	Physics
7	James	Joyce	Teacher	English	32994.0	0.50	No	English 6-12	English 6-12
8	Hedy	Lamarr	Teacher	Science	27919.0	0.50	No	PENDING	NaN
9	Carlos	Boozer	Coach	Basketball	42221.0	NaN	No	Physical ed	NaN
10	Young	Boozer	Coach	NaN	34700.0	NaN	No	Political sci.	Political sci.
11	Micheal	Larsen	Teacher	English	40071.0	0.80	No	Vocal music	English

Awesome stuff! Now we don't have two columns of scattered data, we have one column of densely populated data.`

Dealing with Excel Dates

Finally, notice how the hire_date column isn't date formatted. It's got this weird Excel serialization. To clean up this data, we can use the :py:meth:convert_excel_date method.

df_clean = (
    df.clean_names()
    .remove_empty()
    .rename_column("%_allocated", "percent_allocated")
    .rename_column("full_time_", "full_time")
    .coalesce("certification", "certification_1", target_column_name="certification")
    .convert_excel_date("hire_date")
)
df_clean

	first_name	last_name	employee_status	subject	hire_date	percent_allocated	full_time	certification	certification_1
0	Jason	Bourne	Teacher	PE	2008-08-30	0.75	Yes	Physical ed	Theater
1	Jason	Bourne	Teacher	Drafting	2008-08-30	0.25	Yes	Physical ed	Theater
2	Alicia	Keys	Teacher	Music	2001-08-15	1.00	Yes	Instr. music	Vocal music
3	Ada	Lovelace	Teacher	NaN	1975-05-01	1.00	Yes	PENDING	Computers
4	Desus	Nice	Administration	Dean	2013-06-06	1.00	Yes	PENDING	NaN
5	Chien-Shiung	Wu	Teacher	Physics	1930-03-20	0.50	Yes	Science 6-12	Physics
6	Chien-Shiung	Wu	Teacher	Chemistry	1930-03-20	0.50	Yes	Science 6-12	Physics
7	James	Joyce	Teacher	English	1990-05-01	0.50	No	English 6-12	English 6-12
8	Hedy	Lamarr	Teacher	Science	1976-06-08	0.50	No	PENDING	NaN
9	Carlos	Boozer	Coach	Basketball	2015-08-05	NaN	No	Physical ed	NaN
10	Young	Boozer	Coach	NaN	1995-01-01	NaN	No	Political sci.	Political sci.
11	Micheal	Larsen	Teacher	English	2009-09-15	0.80	No	Vocal music	English

Comparing the DataFrame Before and After `pyjanitor`

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 15))

mn.matrix(df, ax=ax1, sparkline=False)
mn.matrix(df_clean, ax=ax2, sparkline=False)

plt.tight_layout();

We have a cleaned dataframe!