Normalization and Standardization

Normalization makes data more meaningful by converting absolute values into comparisons with related values. Chris Vallier has produced this demonstration of normalization using PyJanitor.

pyjanitor functions demonstrated here:

import janitor
import pandas as pd
import seaborn as sns

sns.set(style="whitegrid")

Load data

We'll use a dataset with fuel efficiency in miles per gallon ("mpg"), engine displacement in cubic centimeters ("disp"), and horsepower ("hp") for a variety of car models. It's a crazy, but customary, mix of units.

csv_file = "https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv"
cars_df = pd.read_csv(csv_file)

Quantities without units are dangerous, so let's use pyjanitor's rename_column...

cars_df = cars_df.rename_column("disp", "disp_cc")

Examine raw data

cars_df.head()

	model	mpg	cyl	disp_cc	hp	drat	wt	qsec	vs	am	gear	carb
0	Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
1	Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
2	Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
3	Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
4	Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2

Visualize

Each value makes more sense viewed in comparison to the other models. We'll use simple Seaborn bar plots.

mpg by model

cars_df = cars_df.sort_values("mpg", ascending=False)
sns.barplot(
    y="model",
    x="mpg",
    data=cars_df,
    color="b",
    orient="h",
)

<AxesSubplot:xlabel='mpg', ylabel='model'>

displacement by model

cars_df = cars_df.sort_values("disp_cc", ascending=False)
sns.barplot(y="model", x="disp_cc", data=cars_df, color="b", orient="h")

<AxesSubplot:xlabel='disp_cc', ylabel='model'>

horsepower by model

cars_df = cars_df.sort_values("hp", ascending=False)
sns.barplot(y="model", x="hp", data=cars_df, color="b", orient="h")

<AxesSubplot:xlabel='hp', ylabel='model'>

min-max normalization

First we'll use pyjanitor's min_max_scale to rescale the mpg, disp_cc, and hp columns in-place so that each value varies from 0 to 1.

(
    cars_df.min_max_scale(col_name="mpg", new_max=1, new_min=0)
    .min_max_scale(col_name="disp_cc", new_max=1, new_min=0)
    .min_max_scale(col_name="hp", new_max=1, new_min=0)
)

	model	mpg	cyl	disp_cc	hp	drat	wt	qsec	vs	am	gear	carb
30	Maserati Bora	0.195745	8	0.573460	1.000000	3.54	3.570	14.60	0	1	5	8
28	Ford Pantera L	0.229787	8	0.698179	0.749117	4.22	3.170	14.50	0	1	5	4
6	Duster 360	0.165957	8	0.720629	0.681979	3.21	3.570	15.84	0	0	3	4
23	Camaro Z28	0.123404	8	0.695685	0.681979	3.73	3.840	15.41	0	0	3	4
16	Chrysler Imperial	0.182979	8	0.920180	0.628975	3.23	5.345	17.42	0	0	3	4
15	Lincoln Continental	0.000000	8	0.970067	0.575972	3.00	5.424	17.82	0	0	3	4
14	Cadillac Fleetwood	0.000000	8	1.000000	0.540636	2.93	5.250	17.98	0	0	3	4
13	Merc 450SLC	0.204255	8	0.510601	0.452297	3.07	3.780	18.00	0	0	3	3
11	Merc 450SE	0.255319	8	0.510601	0.452297	3.07	4.070	17.40	0	0	3	3
12	Merc 450SL	0.293617	8	0.510601	0.452297	3.07	3.730	17.60	0	0	3	3
24	Pontiac Firebird	0.374468	8	0.820404	0.434629	3.08	3.845	17.05	0	0	3	2
4	Hornet Sportabout	0.353191	8	0.720629	0.434629	3.15	3.440	17.02	0	0	3	2
29	Ferrari Dino	0.395745	6	0.184335	0.434629	3.62	2.770	15.50	0	1	5	6
21	Dodge Challenger	0.217021	8	0.615864	0.346290	2.76	3.520	16.87	0	0	3	2
22	AMC Javelin	0.204255	8	0.580943	0.346290	3.15	3.435	17.30	0	0	3	2
10	Merc 280C	0.314894	6	0.240708	0.250883	3.92	3.440	18.90	1	0	4	4
9	Merc 280	0.374468	6	0.240708	0.250883	3.92	3.440	18.30	1	0	4	4
27	Lotus Europa	0.851064	4	0.059865	0.215548	3.77	1.513	16.90	1	1	5	2
0	Mazda RX4	0.451064	6	0.221751	0.204947	3.90	2.620	16.46	0	1	4	4
1	Mazda RX4 Wag	0.451064	6	0.221751	0.204947	3.90	2.875	17.02	0	1	4	4
3	Hornet 4 Drive	0.468085	6	0.466201	0.204947	3.08	3.215	19.44	1	0	3	1
31	Volvo 142E	0.468085	4	0.124470	0.201413	4.11	2.780	18.60	1	1	4	2
5	Valiant	0.327660	6	0.383886	0.187279	2.76	3.460	20.22	1	0	3	1
20	Toyota Corona	0.472340	4	0.122225	0.159011	3.70	2.465	20.01	1	0	3	1
8	Merc 230	0.527660	4	0.173859	0.151943	3.92	3.150	22.90	1	0	4	2
2	Datsun 710	0.527660	4	0.092043	0.144876	3.85	2.320	18.61	1	1	4	1
26	Porsche 914-2	0.663830	4	0.122724	0.137809	4.43	2.140	16.70	0	1	5	2
25	Fiat X1-9	0.719149	4	0.019706	0.049470	4.08	1.935	18.90	1	1	4	1
17	Fiat 128	0.936170	4	0.018957	0.049470	4.08	2.200	19.47	1	1	4	1
19	Toyota Corolla	1.000000	4	0.000000	0.045936	4.22	1.835	19.90	1	1	4	1
7	Merc 240D	0.595745	4	0.188576	0.035336	3.69	3.190	20.00	1	0	4	2
18	Honda Civic	0.851064	4	0.011474	0.000000	4.93	1.615	18.52	1	1	4	2

The shapes of the bar graphs remain the same, but the horizontal axes show the new scale.

mpg (min-max normalized)

cars_df = cars_df.sort_values("mpg", ascending=False)
sns.barplot(y="model", x="mpg", data=cars_df, color="b", orient="h")

<AxesSubplot:xlabel='mpg', ylabel='model'>

displacement (min-max normalized)

cars_df = cars_df.sort_values("disp_cc", ascending=False)
sns.barplot(y="model", x="disp_cc", data=cars_df, color="b", orient="h")

<AxesSubplot:xlabel='disp_cc', ylabel='model'>

horsepower (min-max normalized)

cars_df = cars_df.sort_values("hp", ascending=False)
sns.barplot(y="model", x="hp", data=cars_df, color="b", orient="h")

<AxesSubplot:xlabel='hp', ylabel='model'>

Standardization (z-score)

Next we'll convert to standard scores. This expresses each value in terms of its standard deviations from the mean, expressing where each model stands in relation to the others.

We'll use pyjanitor's transform_columns to apply the standard score calculation, (x - x.mean()) / x.std(), to each value in each of the columns we're evaluating.

cars_df.transform_columns(
    ["mpg", "disp_cc", "hp"], lambda x: (x - x.mean()) / x.std(), elementwise=False
)

	model	mpg	cyl	disp_cc	hp	drat	wt	qsec	vs	am	gear	carb
30	Maserati Bora	-0.844644	8	0.567039	2.746567	3.54	3.570	14.60	0	1	5	8
28	Ford Pantera L	-0.711907	8	0.970465	1.711021	4.22	3.170	14.50	0	1	5	4
6	Duster 360	-0.960789	8	1.043081	1.433903	3.21	3.570	15.84	0	0	3	4
23	Camaro Z28	-1.126710	8	0.962396	1.433903	3.73	3.840	15.41	0	0	3	4
16	Chrysler Imperial	-0.894420	8	1.688562	1.215126	3.23	5.345	17.42	0	0	3	4
15	Lincoln Continental	-1.607883	8	1.849932	0.996348	3.00	5.424	17.82	0	0	3	4
14	Cadillac Fleetwood	-1.607883	8	1.946754	0.850497	2.93	5.250	17.98	0	0	3	4
13	Merc 450SLC	-0.811460	8	0.363713	0.485868	3.07	3.780	18.00	0	0	3	3
11	Merc 450SE	-0.612354	8	0.363713	0.485868	3.07	4.070	17.40	0	0	3	3
12	Merc 450SL	-0.463025	8	0.363713	0.485868	3.07	3.730	17.60	0	0	3	3
24	Pontiac Firebird	-0.147774	8	1.365821	0.412942	3.08	3.845	17.05	0	0	3	2
4	Hornet Sportabout	-0.230735	8	1.043081	0.412942	3.15	3.440	17.02	0	0	3	2
29	Ferrari Dino	-0.064813	6	-0.691647	0.412942	3.62	2.770	15.50	0	1	5	6
21	Dodge Challenger	-0.761683	8	0.704204	0.048313	2.76	3.520	16.87	0	0	3	2
22	AMC Javelin	-0.811460	8	0.591245	0.048313	3.15	3.435	17.30	0	0	3	2
10	Merc 280C	-0.380064	6	-0.509299	-0.345486	3.92	3.440	18.90	1	0	4	4
9	Merc 280	-0.147774	6	-0.509299	-0.345486	3.92	3.440	18.30	1	0	4	4
27	Lotus Europa	1.710547	4	-1.094266	-0.491337	3.77	1.513	16.90	1	1	5	2
0	Mazda RX4	0.150885	6	-0.570620	-0.535093	3.90	2.620	16.46	0	1	4	4
1	Mazda RX4 Wag	0.150885	6	-0.570620	-0.535093	3.90	2.875	17.02	0	1	4	4
3	Hornet 4 Drive	0.217253	6	0.220094	-0.535093	3.08	3.215	19.44	1	0	3	1
31	Volvo 142E	0.217253	4	-0.885292	-0.549678	4.11	2.780	18.60	1	1	4	2
5	Valiant	-0.330287	6	-0.046167	-0.608019	2.76	3.460	20.22	1	0	3	1
20	Toyota Corona	0.233846	4	-0.892553	-0.724700	3.70	2.465	20.01	1	0	3	1
8	Merc 230	0.449543	4	-0.725535	-0.753870	3.92	3.150	22.90	1	0	4	2
2	Datsun 710	0.449543	4	-0.990182	-0.783040	3.85	2.320	18.61	1	1	4	1
26	Porsche 914-2	0.980492	4	-0.890939	-0.812211	4.43	2.140	16.70	0	1	5	2
25	Fiat X1-9	1.196190	4	-1.224169	-1.176840	4.08	1.935	18.90	1	1	4	1
17	Fiat 128	2.042389	4	-1.226589	-1.176840	4.08	2.200	19.47	1	1	4	1
19	Toyota Corolla	2.291272	4	-1.287910	-1.191425	4.22	1.835	19.90	1	1	4	1
7	Merc 240D	0.715018	4	-0.677931	-1.235180	3.69	3.190	20.00	1	0	4	2
18	Honda Civic	1.710547	4	-1.250795	-1.381032	4.93	1.615	18.52	1	1	4	2

Standardized mpg

cars_df = cars_df.sort_values("mpg", ascending=False)
sns.barplot(
    y="model",
    x="mpg",
    data=cars_df,
    color="b",
    orient="h",
)

<AxesSubplot:xlabel='mpg', ylabel='model'>

Standardized displacement

cars_df = cars_df.sort_values("disp_cc", ascending=False)
sns.barplot(y="model", x="disp_cc", data=cars_df, color="b", orient="h")

<AxesSubplot:xlabel='disp_cc', ylabel='model'>

Standardized horsepower

cars_df = cars_df.sort_values("hp", ascending=False)
sns.barplot(y="model", x="hp", data=cars_df, color="b", orient="h")

<AxesSubplot:xlabel='hp', ylabel='model'>