Replacing Bad Values

This is US wind turbine data. The numeric fields use -9999 as a null value for missing data. Using -9999 as a null value in numeric fields will cause big problems for any summary statistics like totals, means, etc, we should change that to something else, like np.NaN which Pandas sum and mean functions will automatically filter out.
You can see that the means for before and after replacing -9999 with np.NaN are very different. You can use Janitor's find_replace to easily replace them.

import janitor
import numpy as np
import pandas as pd

Load Wind Turbine Data

wind = pd.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-11-06/us_wind.csv"
)
wind.head()

	case_id	faa_ors	faa_asn	usgs_pr_id	t_state	t_county	t_fips	p_name	p_year	p_tnum	...	t_hh	t_rd	t_rsa	t_ttlh	t_conf_atr	t_conf_loc	t_img_date	t_img_srce	xlong	ylat
0	3073429	missing	missing	4960	CA	Kern County	6029	251 Wind	1987	194	...	-9999.0	-9999.0	-9999.0	-9999.0	2	3	1/1/2012	NAIP	-118.360725	35.083778
1	3071522	missing	missing	4997	CA	Kern County	6029	251 Wind	1987	194	...	-9999.0	-9999.0	-9999.0	-9999.0	2	3	1/1/2012	NAIP	-118.361168	35.081512
2	3073425	missing	missing	4957	CA	Kern County	6029	251 Wind	1987	194	...	-9999.0	-9999.0	-9999.0	-9999.0	2	3	1/1/2012	NAIP	-118.360420	35.084709
3	3071569	missing	missing	5023	CA	Kern County	6029	251 Wind	1987	194	...	-9999.0	-9999.0	-9999.0	-9999.0	2	3	7/31/2016	Digital Globe	-118.364029	35.079418
4	3005252	missing	missing	5768	CA	Kern County	6029	251 Wind	1987	194	...	-9999.0	-9999.0	-9999.0	-9999.0	2	3	11/23/2017	Digital Globe	-118.354286	35.085594

5 rows × 24 columns

Check Mean

wind.t_hh.mean()

-1069.986537767466

The t_hh column appears to be affected by -9999 values. What are all the columns that are affected?

[col for col in wind.columns if -9999 in wind[col].values]

['usgs_pr_id', 'p_year', 'p_cap', 't_cap', 't_hh', 't_rd', 't_rsa', 't_ttlh']

Note: When replacing the -9999 values you can make a copy of the dataframe to prevent making modifications to the original dataframe.

At first glance, it looks like the mean is negative, but this is only because of the bad values (-9999.0) throughout the column. To get the right mean, we should replace them!## Replace Bad Values with NaNs

mapping = {-9999.0: np.nan}
wind2 = wind.find_replace(
    usgs_pr_id=mapping,
    p_tnum=mapping,
    p_cap=mapping,
    t_cap=mapping,
    t_hh=mapping,
    t_rd=mapping,
    t_rsa=mapping,
    t_ttlh=mapping,
)
wind2.head()

	case_id	faa_ors	faa_asn	usgs_pr_id	t_state	t_county	t_fips	p_name	p_year	p_tnum	...	t_hh	t_rd	t_rsa	t_ttlh	t_conf_atr	t_conf_loc	t_img_date	t_img_srce	xlong	ylat
0	3073429	missing	missing	4960.0	CA	Kern County	6029	251 Wind	1987	194	...	NaN	NaN	NaN	NaN	2	3	1/1/2012	NAIP	-118.360725	35.083778
1	3071522	missing	missing	4997.0	CA	Kern County	6029	251 Wind	1987	194	...	NaN	NaN	NaN	NaN	2	3	1/1/2012	NAIP	-118.361168	35.081512
2	3073425	missing	missing	4957.0	CA	Kern County	6029	251 Wind	1987	194	...	NaN	NaN	NaN	NaN	2	3	1/1/2012	NAIP	-118.360420	35.084709
3	3071569	missing	missing	5023.0	CA	Kern County	6029	251 Wind	1987	194	...	NaN	NaN	NaN	NaN	2	3	7/31/2016	Digital Globe	-118.364029	35.079418
4	3005252	missing	missing	5768.0	CA	Kern County	6029	251 Wind	1987	194	...	NaN	NaN	NaN	NaN	2	3	11/23/2017	Digital Globe	-118.354286	35.085594

5 rows × 24 columns

Check the Mean (again)

wind2.t_hh.mean()

77.31203064391

And, now that the bad values were replaced by NaNs (which the mean() method ignores), the calculated mean is correct!

Alternate method

If we look at the description of the data (see README) we can find descriptions for our data values, for example:

p_year: Year project became operational
t_hh: Turbine hub height (meters)
xlong: Longitude

Using our knowledge of the data, this would give us bounds we could use for these values. For example, the oldest electric wind turbine was built in 1887 and this document was written in 2018, so $1887 \leq \text{p_year} \leq 2018$. Turbine hub height should be positive, and a value above 1 km would be silly, so $0 < \text{t_hh} < 1000$. These are wind turbines in the United States, so $-161.76 < \text{xlong} < -68.01$.

(Note that the README actually gives us minima and maxima for the data, so we could get much tighter bounds from that.)

To filter out potential bad values, we will use update_where to remove values outside these ranges.

# Note that update_where mutates the original dataframe
(
    wind.update_where(
        (wind["p_year"] < 1887) | (wind["p_year"] > 2018), "p_year", np.nan
    )
    .update_where((wind["t_hh"] <= 0) | (wind["t_hh"] >= 1000), "t_hh", np.nan)
    .update_where((wind["xlong"] < -161.76) | (wind["xlong"] > -68.01), "xlong", np.nan)
);

Confirming this produces the same result

wind.t_hh.mean()

77.31203064391