import janitor
import pandas as pd
import seaborn as sns
Processing Board Game Data
Background
This dataset comes from the Board Game Geek database. The site's database has more than 90,000 games, with crowd-sourced ratings. This particular subset is limited to only games with at least 50 ratings which were published between 1950 and 2016. This still leaves us with 10,532 games! For more information please check out the tidytuesday repo which is where this example was taken from.
Data Cleaning
%matplotlib inline
One-Shot
This cell demonstrates the cleaning process using the call chaining approach championed in pyjanitor
cleaned_df = (
pd.read_csv(
"https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-12//board_games.csv" # noqa: E501
) # ingest raw data
.clean_names() # removes whitespace, punctuation/symbols, capitalization
.remove_empty() # removes entirely empty rows / columns
.drop(
columns=["image", "thumbnail", "compilation", "game_id"]
) # drops unnecessary columns
)
Multi-Step
These cells repeat the process in a step-by-step manner in order to explain it in more detail
Read in the csv
df = pd.read_csv(
"https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-12/board_games.csv" # noqa: E501
)
df.head(3)
Remove the whitespace, punctuation/symbols, and capitalization form columns
df = df.clean_names()
df.head(3)
Remove all the empty rows and columns if present
df = df.remove_empty()
df.head(3)
Check to see whether "min_playtime" and "max_playtime" columns are equal
len(df[df["min_playtime"] != df["max_playtime"]])
Check to see what percentage of the values in the "compilation" column are not null
len(df[df["compilation"].notnull()]) / len(df)
Drop unnecessary columns
The 'compilation' column was demonstrated to have little value, the "image" and "thumbnail" columns link to images and are not a factor in this analysis. The "game_id" column can be replaced by using the index.
df = df.drop(columns=["image", "thumbnail", "compilation", "game_id"])
df.head(3)
Sample Analysis
What Categories appear most often?
df["category"].value_counts().head(10)
What is the relationship between games' player numbers, reccomended minimum age, and the game's estimated length?
sns.pairplot(
df,
x_vars=["min_age", "min_players", "min_playtime"],
y_vars="users_rated",
height=7,
aspect=0.7,
);
Preliminary analysis
Without digging into the data too much more it becomes apparent that there are some entries that were improperly entered e.g. having a minimum playtime of 60000 minutes. Otherwise we see some nice bell curves.