Normalization and Standardization
Normalization makes data more meaningful by converting absolute values into comparisons with related values. Chris Vallier has produced this demonstration of normalization using PyJanitor.
pyjanitor functions demonstrated here:
import janitor
import pandas as pd
import seaborn as sns
sns.set(style="whitegrid")
Load data
We'll use a dataset with fuel efficiency in miles per gallon ("mpg"), engine displacement in cubic centimeters ("disp"), and horsepower ("hp") for a variety of car models. It's a crazy, but customary, mix of units.
csv_file = "https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv"
cars_df = pd.read_csv(csv_file)
Quantities without units are dangerous, so let's use pyjanitor's rename_column
...
cars_df = cars_df.rename_column("disp", "disp_cc")
Examine raw data
cars_df.head()
Visualize
Each value makes more sense viewed in comparison to the other models. We'll use simple Seaborn bar plots.
mpg by model
cars_df = cars_df.sort_values("mpg", ascending=False)
sns.barplot(
y="model",
x="mpg",
data=cars_df,
color="b",
orient="h",
)
displacement by model
cars_df = cars_df.sort_values("disp_cc", ascending=False)
sns.barplot(y="model", x="disp_cc", data=cars_df, color="b", orient="h")
horsepower by model
cars_df = cars_df.sort_values("hp", ascending=False)
sns.barplot(y="model", x="hp", data=cars_df, color="b", orient="h")
min-max normalization
First we'll use pyjanitor's min_max_scale to rescale the mpg
, disp_cc
, and hp
columns in-place so that each value varies from 0 to 1.
(
cars_df.min_max_scale(col_name="mpg", new_max=1, new_min=0)
.min_max_scale(col_name="disp_cc", new_max=1, new_min=0)
.min_max_scale(col_name="hp", new_max=1, new_min=0)
)
The shapes of the bar graphs remain the same, but the horizontal axes show the new scale.
mpg (min-max normalized)
cars_df = cars_df.sort_values("mpg", ascending=False)
sns.barplot(y="model", x="mpg", data=cars_df, color="b", orient="h")
displacement (min-max normalized)
cars_df = cars_df.sort_values("disp_cc", ascending=False)
sns.barplot(y="model", x="disp_cc", data=cars_df, color="b", orient="h")
horsepower (min-max normalized)
cars_df = cars_df.sort_values("hp", ascending=False)
sns.barplot(y="model", x="hp", data=cars_df, color="b", orient="h")
Standardization (z-score)
Next we'll convert to standard scores. This expresses each value in terms of its standard deviations from the mean, expressing where each model stands in relation to the others.
We'll use pyjanitor's transform_columns to apply the standard score calculation, (x - x.mean()) / x.std()
, to each value in each of the columns we're evaluating.
cars_df.transform_columns(
["mpg", "disp_cc", "hp"], lambda x: (x - x.mean()) / x.std(), elementwise=False
)
Standardized mpg
cars_df = cars_df.sort_values("mpg", ascending=False)
sns.barplot(
y="model",
x="mpg",
data=cars_df,
color="b",
orient="h",
)
Standardized displacement
cars_df = cars_df.sort_values("disp_cc", ascending=False)
sns.barplot(y="model", x="disp_cc", data=cars_df, color="b", orient="h")
Standardized horsepower
cars_df = cars_df.sort_values("hp", ascending=False)
sns.barplot(y="model", x="hp", data=cars_df, color="b", orient="h")