Processing Dirty Data
Background
This is fake data generated to demonstrate the capabilities of pyjanitor
. It contains a bunch of common problems that we regularly encounter when working with data. Let's go fix it!
Load Packages
Importing pyjanitor
is all that's needed to give Pandas Dataframes extra methods to work with your data.
import janitor
import matplotlib.pyplot as plt
import missingno as mn
import pandas as pd
from pyprojroot import here
%matplotlib inline
Load Data
df = pd.read_excel(here() / "data/dirty_data.xlsx", engine="openpyxl")
df
Visualizing the NaN Values
Quickly visualizing the NaN values with the package missingno
can help us see how dirty the data may be and if the NaN values have any relationships between columns. And as we will see later, it can also be used to visualize the improvement afforded by pyjanitor
.
Here is the current visual state of the data (the right side of the chart gives a vertical line graph of min/max row information):
mn.matrix(df);
Cleaning Column Names
There are a bunch of problems with this data. Firstly, the column names are not lowercase, and they have spaces. This will make it cumbersome to use in a programmatic function. To solve this, we can use the clean_names()
method.
df_clean = df.clean_names()
df_clean.head(2)
Notice now how the column names have been made better.
If you squint at the unclean dataset, you'll notice one row and one column of data that are missing. We can also fix this! Building on top of the code block from above, let's now remove those empty columns using the remove_empty()
method:
df_clean = df.clean_names().remove_empty()
df_clean.head(9).tail(4)
Now this is starting to shape up well!
Renaming Individual Columns
Next, let's rename some of the columns. %_allocated
and full_time?
contain non-alphanumeric characters, so they make it a bit harder to use. We can rename them using the :py:meth:rename_column()
method:
df_clean = (
df.clean_names()
.remove_empty()
.rename_column("%_allocated", "percent_allocated")
.rename_column("full_time_", "full_time")
)
df_clean.head(5)
Note how now we have really nice column names! You might be wondering why I'm not modifying the two certifiation columns -- that is the next thing we'll tackle.
Coalescing Columns
If we look more closely at the two certification
columns, we'll see that they look like this:
df_clean[["certification", "certification_1"]]
Rows 8 and 11 have NaN in the left certification column, but have a value in the right certification column. Let's assume for a moment that the left certification column is intended to record the first certification that a teacher had obtained. In this case, the values in the right certification column on rows 8 and 11 should be moved to the first column. Let's do that with Janitor, using the coalesce()
method, which does the following:
df_clean = (
df.clean_names()
.remove_empty()
.rename_column("%_allocated", "percent_allocated")
.rename_column("full_time_", "full_time")
.coalesce("certification", "certification_1", new_column_name="certification")
)
df_clean
Awesome stuff! Now we don't have two columns of scattered data, we have one column of densely populated data.`
Dealing with Excel Dates
Finally, notice how the hire_date
column isn't date formatted. It's got this weird Excel serialization.
To clean up this data, we can use the :py:meth:convert_excel_date
method.
df_clean = (
df.clean_names()
.remove_empty()
.rename_column("%_allocated", "percent_allocated")
.rename_column("full_time_", "full_time")
.coalesce("certification", "certification_1", target_column_name="certification")
.convert_excel_date("hire_date")
)
df_clean
Comparing the DataFrame Before and After pyjanitor
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 15))
mn.matrix(df, ax=ax1, sparkline=False)
mn.matrix(df_clean, ax=ax2, sparkline=False)
plt.tight_layout();
We have a cleaned dataframe!