Processing International Teacher Data
Background
This sample data comes from the UNESCO Institute of Statistics and can be found at tidytuesdays' github repo. This subset of the the data collected by the UNESCO Institute of Statistics contains country-level data on the number of teachers, teacher-to-student ratios, and related figures.
import janitor
import pandas as pd
import pandas_flavor as pf
dirty_csv = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-05-07/EDULIT_DS_06052019101747206.csv"
dirty_df = pd.read_csv(dirty_csv)
dirty_df.head()
Data Dictionary
Below is the data dictionary from the tidytuesday github repo.
variable | class | description |
---|---|---|
edulit_ind | character | Unique ID |
indicator | character | Education level group ("Lower Secondary Education", "Primary Education", "Upper Secondary Education", "Pre-Primary Education", "Secondary Education", "Tertiary Education", "Post-Secondary Non-Tertiary Education") |
country_code | character | Country code |
country | character | Country Full name |
year | integer (date) | Year |
student_ratio | double | Student to teacher ratio (lower = fewer students/teacher) |
flag_codes | character | Code to indicate some metadata about exceptions |
flags | character | Metadata about exceptions |
The indicator variable describles the education level for each observation. Let's evaluate the actual values of the Indicator column in the data.
# The set of unique values in the indicator column
set(dirty_df.Indicator)
Notice that strings in the dataframe each contain "Pupil-teach ratio in" & "(headcount basis)". We don't need all of this text to analyze the data.
need some custom functions to clean up the strings. We'll need a function that removes a substring, given a pattern, from values in columns. Another function that removes trailing and leading characters from a value in a column. And finally, a function to make the first letter in each string upper case.
Data Cleaning
@pf.register_dataframe_method
def str_remove(df, column_name: str, pat: str, *args, **kwargs):
"""Remove a substring, given its pattern from a string value, in a given column"""
df[column_name] = df[column_name].str.replace(pat, "", *args, **kwargs)
return df
@pf.register_dataframe_method
def str_trim(df, column_name: str, *args, **kwargs):
"""Remove trailing and leading characters, in a given column"""
df[column_name] = df[column_name].str.strip(*args, **kwargs)
return df
@pf.register_dataframe_method
def str_title(df, column_name: str, *args, **kwargs):
"""Make the first letter in each word upper case"""
df[column_name] = df[column_name].str.title(*args, **kwargs)
return df
@pf.register_dataframe_method
def drop_duplicated_column(df, column_name: str, column_order: int = 0):
"""Remove duplicated columns and retain only a column given its order.
Order 0 is to remove the first column, Order 1 is to remove the second column, and etc
"""
cols = list(df.columns)
col_indexes = [
col_idx for col_idx, col_name in enumerate(cols) if col_name == column_name
]
# given that a column could be duplicated, user could opt based on its order
removed_col_idx = col_indexes[column_order]
# get the column indexes without column that is being removed
filtered_cols = [c_i for c_i, c_v in enumerate(cols) if c_i != removed_col_idx]
return df.iloc[:, filtered_cols]
Note in the next example how we are able to chain our manipulations of the data into one process without losing our ability to explain what we are doing. The is the preferred framework for using pyjanitor
py_clean_df = (
dirty_df.clean_names()
# modify string values
.str_remove("indicator", "Pupil-teacher ratio in")
.str_remove("indicator", "(headcount basis)")
.str_remove("indicator", "\\(\\)")
.str_trim("indicator")
.str_trim("country")
.str_title("indicator")
# remove `time` column (which is duplicated). The second `time` is being removed
.drop_duplicated_column("time", 1)
# renaming columns
.rename_column("location", "country_code")
.rename_column("value", "student_ratio")
.rename_column("time", "year")
)
py_clean_df.head()
# ensure that the output from janitor is similar with the clean r's janitor
r_clean_csv = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-05-07/student_teacher_ratio.csv"
r_clean_df = pd.read_csv(r_clean_csv)
pd.testing.assert_frame_equal(r_clean_df, py_clean_df)