Skip to content

Binder

Processing Dirty Data

Background

This is fake data generated to demonstrate the capabilities of pyjanitor. It contains a bunch of common problems that we regularly encounter when working with data. Let's go fix it!

Load Packages

Importing pyjanitor is all that's needed to give Pandas Dataframes extra methods to work with your data.

import janitor
import matplotlib.pyplot as plt
import missingno as mn
import pandas as pd
from pyprojroot import here

%matplotlib inline

Load Data

df = pd.read_excel(here() / "data/dirty_data.xlsx", engine="openpyxl")
df
First Name Last Name Employee Status Subject Hire Date % Allocated Full time? do not edit! ---> Certification Certification.1 Certification.2
0 Jason Bourne Teacher PE 39690.0 0.75 Yes NaN Physical ed Theater NaN
1 Jason Bourne Teacher Drafting 39690.0 0.25 Yes NaN Physical ed Theater NaN
2 Alicia Keys Teacher Music 37118.0 1.00 Yes NaN Instr. music Vocal music NaN
3 Ada Lovelace Teacher NaN 27515.0 1.00 Yes NaN PENDING Computers NaN
4 Desus Nice Administration Dean 41431.0 1.00 Yes NaN PENDING NaN NaN
5 Chien-Shiung Wu Teacher Physics 11037.0 0.50 Yes NaN Science 6-12 Physics NaN
6 Chien-Shiung Wu Teacher Chemistry 11037.0 0.50 Yes NaN Science 6-12 Physics NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 James Joyce Teacher English 32994.0 0.50 No NaN NaN English 6-12 NaN
9 Hedy Lamarr Teacher Science 27919.0 0.50 No NaN PENDING NaN NaN
10 Carlos Boozer Coach Basketball 42221.0 NaN No NaN Physical ed NaN NaN
11 Young Boozer Coach NaN 34700.0 NaN No NaN NaN Political sci. NaN
12 Micheal Larsen Teacher English 40071.0 0.80 No NaN Vocal music English NaN

Visualizing the NaN Values

Quickly visualizing the NaN values with the package missingno can help us see how dirty the data may be and if the NaN values have any relationships between columns. And as we will see later, it can also be used to visualize the improvement afforded by pyjanitor.

Here is the current visual state of the data (the right side of the chart gives a vertical line graph of min/max row information):

mn.matrix(df);

Cleaning Column Names

There are a bunch of problems with this data. Firstly, the column names are not lowercase, and they have spaces. This will make it cumbersome to use in a programmatic function. To solve this, we can use the clean_names() method.

df_clean = df.clean_names()
df_clean.head(2)
first_name last_name employee_status subject hire_date %_allocated full_time_ do_not_edit!_> certification certification_1 certification_2
0 Jason Bourne Teacher PE 39690.0 0.75 Yes NaN Physical ed Theater NaN
1 Jason Bourne Teacher Drafting 39690.0 0.25 Yes NaN Physical ed Theater NaN

Notice now how the column names have been made better.

If you squint at the unclean dataset, you'll notice one row and one column of data that are missing. We can also fix this! Building on top of the code block from above, let's now remove those empty columns using the remove_empty() method:

df_clean = df.clean_names().remove_empty()
df_clean.head(9).tail(4)
first_name last_name employee_status subject hire_date %_allocated full_time_ certification certification_1
5 Chien-Shiung Wu Teacher Physics 11037.0 0.5 Yes Science 6-12 Physics
6 Chien-Shiung Wu Teacher Chemistry 11037.0 0.5 Yes Science 6-12 Physics
7 James Joyce Teacher English 32994.0 0.5 No NaN English 6-12
8 Hedy Lamarr Teacher Science 27919.0 0.5 No PENDING NaN

Now this is starting to shape up well!

Renaming Individual Columns

Next, let's rename some of the columns. %_allocated and full_time? contain non-alphanumeric characters, so they make it a bit harder to use. We can rename them using the :py:meth:rename_column() method:

df_clean = (
    df.clean_names()
    .remove_empty()
    .rename_column("%_allocated", "percent_allocated")
    .rename_column("full_time_", "full_time")
)

df_clean.head(5)
first_name last_name employee_status subject hire_date percent_allocated full_time certification certification_1
0 Jason Bourne Teacher PE 39690.0 0.75 Yes Physical ed Theater
1 Jason Bourne Teacher Drafting 39690.0 0.25 Yes Physical ed Theater
2 Alicia Keys Teacher Music 37118.0 1.00 Yes Instr. music Vocal music
3 Ada Lovelace Teacher NaN 27515.0 1.00 Yes PENDING Computers
4 Desus Nice Administration Dean 41431.0 1.00 Yes PENDING NaN

Note how now we have really nice column names! You might be wondering why I'm not modifying the two certifiation columns -- that is the next thing we'll tackle.

Coalescing Columns

If we look more closely at the two certification columns, we'll see that they look like this:

df_clean[["certification", "certification_1"]]
certification certification_1
0 Physical ed Theater
1 Physical ed Theater
2 Instr. music Vocal music
3 PENDING Computers
4 PENDING NaN
5 Science 6-12 Physics
6 Science 6-12 Physics
7 NaN English 6-12
8 PENDING NaN
9 Physical ed NaN
10 NaN Political sci.
11 Vocal music English

Rows 8 and 11 have NaN in the left certification column, but have a value in the right certification column. Let's assume for a moment that the left certification column is intended to record the first certification that a teacher had obtained. In this case, the values in the right certification column on rows 8 and 11 should be moved to the first column. Let's do that with Janitor, using the coalesce() method, which does the following:

df_clean = (
    df.clean_names()
    .remove_empty()
    .rename_column("%_allocated", "percent_allocated")
    .rename_column("full_time_", "full_time")
    .coalesce("certification", "certification_1", new_column_name="certification")
)

df_clean
first_name last_name employee_status subject hire_date percent_allocated full_time certification certification_1
0 Jason Bourne Teacher PE 39690.0 0.75 Yes Physical ed Theater
1 Jason Bourne Teacher Drafting 39690.0 0.25 Yes Physical ed Theater
2 Alicia Keys Teacher Music 37118.0 1.00 Yes Instr. music Vocal music
3 Ada Lovelace Teacher NaN 27515.0 1.00 Yes PENDING Computers
4 Desus Nice Administration Dean 41431.0 1.00 Yes PENDING NaN
5 Chien-Shiung Wu Teacher Physics 11037.0 0.50 Yes Science 6-12 Physics
6 Chien-Shiung Wu Teacher Chemistry 11037.0 0.50 Yes Science 6-12 Physics
7 James Joyce Teacher English 32994.0 0.50 No English 6-12 English 6-12
8 Hedy Lamarr Teacher Science 27919.0 0.50 No PENDING NaN
9 Carlos Boozer Coach Basketball 42221.0 NaN No Physical ed NaN
10 Young Boozer Coach NaN 34700.0 NaN No Political sci. Political sci.
11 Micheal Larsen Teacher English 40071.0 0.80 No Vocal music English

Awesome stuff! Now we don't have two columns of scattered data, we have one column of densely populated data.`

Dealing with Excel Dates

Finally, notice how the hire_date column isn't date formatted. It's got this weird Excel serialization. To clean up this data, we can use the :py:meth:convert_excel_date method.

df_clean = (
    df.clean_names()
    .remove_empty()
    .rename_column("%_allocated", "percent_allocated")
    .rename_column("full_time_", "full_time")
    .coalesce("certification", "certification_1", target_column_name="certification")
    .convert_excel_date("hire_date")
)
df_clean
first_name last_name employee_status subject hire_date percent_allocated full_time certification certification_1
0 Jason Bourne Teacher PE 2008-08-30 0.75 Yes Physical ed Theater
1 Jason Bourne Teacher Drafting 2008-08-30 0.25 Yes Physical ed Theater
2 Alicia Keys Teacher Music 2001-08-15 1.00 Yes Instr. music Vocal music
3 Ada Lovelace Teacher NaN 1975-05-01 1.00 Yes PENDING Computers
4 Desus Nice Administration Dean 2013-06-06 1.00 Yes PENDING NaN
5 Chien-Shiung Wu Teacher Physics 1930-03-20 0.50 Yes Science 6-12 Physics
6 Chien-Shiung Wu Teacher Chemistry 1930-03-20 0.50 Yes Science 6-12 Physics
7 James Joyce Teacher English 1990-05-01 0.50 No English 6-12 English 6-12
8 Hedy Lamarr Teacher Science 1976-06-08 0.50 No PENDING NaN
9 Carlos Boozer Coach Basketball 2015-08-05 NaN No Physical ed NaN
10 Young Boozer Coach NaN 1995-01-01 NaN No Political sci. Political sci.
11 Micheal Larsen Teacher English 2009-09-15 0.80 No Vocal music English

Comparing the DataFrame Before and After pyjanitor

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 15))

mn.matrix(df, ax=ax1, sparkline=False)
mn.matrix(df_clean, ax=ax2, sparkline=False)

plt.tight_layout();

We have a cleaned dataframe!