Processing French train data
Background
The SNCF (National Society of French Railways) is France's national state-owned railway company. Founded in 1938, it operates the country's national rail traffic along with Monaco, including the TGV, France's high-speed rail network. This dataset covers 2015-2018 with many different train stations. The dataset primarily covers aggregate trip times, delay times, cause for delay, etc., for each station there are 27 columns in total. A TGV route map can be seen here.
The data
The source data set is available from the SNCF. Check out this visualization of it. This has been used in a tidy tuesday previously. The full data set is available but we will work with a subset.
Preliminaries
import os
from collections import Counter
import janitor
import pandas as pd
import seaborn as sns
# allow plots to appear directly in the notebook
%matplotlib inline
Call chaining example
First, we run all the methods using pyjanitor's preferred call chaining approach. This code updates the column names, removes any empty rows/columns, and drops some unneeded columns in a very readable manner.
chained_df = (
pd.read_csv(
"https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-26/small_trains.csv"
) # ingest raw data
.clean_names() # removes whitespace, punctuation/symbols, capitalization
.remove_empty() # removes entirely empty rows / columns
.rename_column("num_late_at_departure", "num_departing_late") # renames 1 column
.drop(
columns=["service", "delay_cause", "delayed_number"]
) # drops 3 unnecessary columns
# add 2 new columns with a calculation
.join_apply(
lambda df: df.num_departing_late / df.total_num_trips, "prop_late_departures"
)
.join_apply(
lambda df: df.num_arriving_late / df.total_num_trips, "prop_late_arrivals"
)
)
chained_df.head(3)
Step by step through the methods
Now, we will import the French data again and then use the methods from the call chain one at a time. Our subset of the train data has over 32000 rows and 13 columns.
df = pd.read_csv(
"https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-26/small_trains.csv"
)
df.shape
Cleaning column names
The clean_names method converts the column names to lowercase and replaces all spaces with underscores. For this data set, it actually does not modify any of the names.
original_columns = df.columns
df = df.clean_names()
new_columns = df.columns
original_columns == new_columns
new_columns
Renaming columns
We rename the "num_late_at_departure" column for consistency purposes with the rename_column method.
df = df.rename_column("num_late_at_departure", "num_departing_late")
df.columns
Dropping empty columns and rows
The remove_empty method looks for empty columns and rows and drops them if found.
df.shape
df = df.remove_empty()
df.shape
Drop unneeded columns
We identify 3 columns that we decided are unnecessary for the analysis and can quickly drop them with the aptly named drop_columns method.
df = df.drop(columns=["service", "delay_cause", "delayed_number"])
df.columns
# gives us the top ten departure stations from that column
Counter(df["departure_station"]).most_common(10)
We use seaborn to quickly visualize how quickly departure and arrivals times were late versus the total number of trips for each of the over 30000 routes in the database.
sns.pairplot(
df,
x_vars=["num_departing_late", "num_arriving_late"],
y_vars="total_num_trips",
height=7,
aspect=0.7,
)
Add additional statistics as new columns
We can add columns containing additional statistics concerning the proportion of time each route is late either departing or arriving by using the add_columns method for each route.
Note the difference between how we added the two columns below and the same code in the chained_df file creation at the top of the notebook. In order to operate on the df that was in the process of being created in the call chain, we had to use join_apply with a lambda function instead of the add_columns method. Alternatively, we could have split the chain into two separate chains with the df being created in the first chain and the add_columns method being used in the second chain.
df_prop = df.add_columns(
prop_late_departures=df.num_departing_late / df.total_num_trips,
prop_late_arrivals=df.num_arriving_late / df.total_num_trips,
)
df_prop.head(3)