Processing International Teacher Data

Background

This sample data comes from the UNESCO Institute of Statistics and can be found at tidytuesdays' github repo. This subset of the the data collected by the UNESCO Institute of Statistics contains country-level data on the number of teachers, teacher-to-student ratios, and related figures.

import janitor
import pandas as pd
import pandas_flavor as pf

dirty_csv = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-05-07/EDULIT_DS_06052019101747206.csv"
dirty_df = pd.read_csv(dirty_csv)
dirty_df.head()

	EDULIT_IND	Indicator	LOCATION	Country	TIME	Time	Value	Flag Codes	Flags
0	PTRHC_2	Pupil-teacher ratio in lower secondary educati...	MRT	Mauritania	2013	2013	56.59395	NaN	NaN
1	PTRHC_2	Pupil-teacher ratio in lower secondary educati...	MRT	Mauritania	2014	2014	51.94690	NaN	NaN
2	PTRHC_2	Pupil-teacher ratio in lower secondary educati...	MRT	Mauritania	2015	2015	53.22717	NaN	NaN
3	PTRHC_2	Pupil-teacher ratio in lower secondary educati...	MRT	Mauritania	2016	2016	38.18923	NaN	NaN
4	PTRHC_1	Pupil-teacher ratio in primary education (head...	COD	Democratic Republic of the Congo	2012	2012	34.74758	NaN	NaN

Data Dictionary

Below is the data dictionary from the tidytuesday github repo.

variable	class	description
edulit_ind	character	Unique ID
indicator	character	Education level group ("Lower Secondary Education", "Primary Education", "Upper Secondary Education", "Pre-Primary Education", "Secondary Education", "Tertiary Education", "Post-Secondary Non-Tertiary Education")
country_code	character	Country code
country	character	Country Full name
year	integer (date)	Year
student_ratio	double	Student to teacher ratio (lower = fewer students/teacher)
flag_codes	character	Code to indicate some metadata about exceptions
flags	character	Metadata about exceptions

The indicator variable describles the education level for each observation. Let's evaluate the actual values of the Indicator column in the data.

# The set of unique values in the indicator column
set(dirty_df.Indicator)

{'Pupil-teacher ratio in lower secondary education (headcount basis)',
 'Pupil-teacher ratio in post-secondary non-tertiary education (headcount basis)',
 'Pupil-teacher ratio in pre-primary education (headcount basis)',
 'Pupil-teacher ratio in primary education (headcount basis)',
 'Pupil-teacher ratio in secondary education (headcount basis)',
 'Pupil-teacher ratio in tertiary education (headcount basis)',
 'Pupil-teacher ratio in upper secondary education (headcount basis)'}

Notice that strings in the dataframe each contain "Pupil-teach ratio in" & "(headcount basis)". We don't need all of this text to analyze the data.

need some custom functions to clean up the strings. We'll need a function that removes a substring, given a pattern, from values in columns. Another function that removes trailing and leading characters from a value in a column. And finally, a function to make the first letter in each string upper case.

Data Cleaning

@pf.register_dataframe_method
def str_remove(df, column_name: str, pat: str, *args, **kwargs):
    """Remove a substring, given its pattern from a string value, in a given column"""
    df[column_name] = df[column_name].str.replace(pat, "", *args, **kwargs)
    return df


@pf.register_dataframe_method
def str_trim(df, column_name: str, *args, **kwargs):
    """Remove trailing and leading characters, in a given column"""
    df[column_name] = df[column_name].str.strip(*args, **kwargs)
    return df


@pf.register_dataframe_method
def str_title(df, column_name: str, *args, **kwargs):
    """Make the first letter in each word upper case"""
    df[column_name] = df[column_name].str.title(*args, **kwargs)
    return df


@pf.register_dataframe_method
def drop_duplicated_column(df, column_name: str, column_order: int = 0):
    """Remove duplicated columns and retain only a column given its order.
    Order 0 is to remove the first column, Order 1 is to remove the second column, and etc
    """

    cols = list(df.columns)
    col_indexes = [
        col_idx for col_idx, col_name in enumerate(cols) if col_name == column_name
    ]

    # given that a column could be duplicated, user could opt based on its order
    removed_col_idx = col_indexes[column_order]
    # get the column indexes without column that is being removed
    filtered_cols = [c_i for c_i, c_v in enumerate(cols) if c_i != removed_col_idx]
    return df.iloc[:, filtered_cols]

Note in the next example how we are able to chain our manipulations of the data into one process without losing our ability to explain what we are doing. The is the preferred framework for using pyjanitor

py_clean_df = (
    dirty_df.clean_names()
    # modify string values
    .str_remove("indicator", "Pupil-teacher ratio in")
    .str_remove("indicator", "(headcount basis)")
    .str_remove("indicator", "\\(\\)")
    .str_trim("indicator")
    .str_trim("country")
    .str_title("indicator")
    # remove `time` column (which is duplicated). The second `time` is being removed
    .drop_duplicated_column("time", 1)
    # renaming columns
    .rename_column("location", "country_code")
    .rename_column("value", "student_ratio")
    .rename_column("time", "year")
)

py_clean_df.head()

/tmp/ipykernel_2019/4179283386.py:4: FutureWarning: The default value of regex will change from True to False in a future version.
  df[column_name] = df[column_name].str.replace(pat, "", *args, **kwargs)
/tmp/ipykernel_2019/4179283386.py:4: FutureWarning: The default value of regex will change from True to False in a future version.
  df[column_name] = df[column_name].str.replace(pat, "", *args, **kwargs)

	edulit_ind	indicator	country_code	country	year	student_ratio	flag_codes	flags
0	PTRHC_2	Lower Secondary Education	MRT	Mauritania	2013	56.59395	NaN	NaN
1	PTRHC_2	Lower Secondary Education	MRT	Mauritania	2014	51.94690	NaN	NaN
2	PTRHC_2	Lower Secondary Education	MRT	Mauritania	2015	53.22717	NaN	NaN
3	PTRHC_2	Lower Secondary Education	MRT	Mauritania	2016	38.18923	NaN	NaN
4	PTRHC_1	Primary Education	COD	Democratic Republic of the Congo	2012	34.74758	NaN	NaN

# ensure that the output from janitor is similar with the clean r's janitor
r_clean_csv = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-05-07/student_teacher_ratio.csv"
r_clean_df = pd.read_csv(r_clean_csv)

pd.testing.assert_frame_equal(r_clean_df, py_clean_df)