Replacing Bad Values
This is US wind turbine data. The numeric fields use -9999 as a null value for missing data.
Using -9999 as a null value in numeric fields will cause big problems for any summary statistics like totals, means, etc,
we should change that to something else, like np.NaN which Pandas sum and mean functions will automatically filter out.
You can see that the means for before and after replacing -9999 with np.NaN are very different.
You can use Janitor's find_replace to easily replace them.
import janitor
import numpy as np
import pandas as pd
Load Wind Turbine Data
wind = pd.read_csv(
"https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-11-06/us_wind.csv"
)
wind.head()
Check Mean
wind.t_hh.mean()
The t_hh column appears to be affected by -9999 values. What are all the columns that are affected?
[col for col in wind.columns if -9999 in wind[col].values]
Note: When replacing the -9999 values you can make a copy of the dataframe to prevent making modifications to the original dataframe.
At first glance, it looks like the mean is negative, but this is only because of the bad values (-9999.0) throughout the column. To get the right mean, we should replace them!## Replace Bad Values with NaNs
mapping = {-9999.0: np.nan}
wind2 = wind.find_replace(
usgs_pr_id=mapping,
p_tnum=mapping,
p_cap=mapping,
t_cap=mapping,
t_hh=mapping,
t_rd=mapping,
t_rsa=mapping,
t_ttlh=mapping,
)
wind2.head()
Check the Mean (again)
wind2.t_hh.mean()
And, now that the bad values were replaced by NaNs (which the mean() method ignores), the calculated mean is correct!
Alternate method
If we look at the description of the data (see README) we can find descriptions for our data values, for example:
p_year
: Year project became operationalt_hh
: Turbine hub height (meters)xlong
: Longitude
Using our knowledge of the data, this would give us bounds we could use for these values. For example, the oldest electric wind turbine was built in 1887 and this document was written in 2018, so $1887 \leq \text{p_year} \leq 2018$. Turbine hub height should be positive, and a value above 1 km would be silly, so $0 < \text{t_hh} < 1000$. These are wind turbines in the United States, so $-161.76 < \text{xlong} < -68.01$.
(Note that the README actually gives us minima and maxima for the data, so we could get much tighter bounds from that.)
To filter out potential bad values, we will use update_where
to remove values outside these ranges.
# Note that update_where mutates the original dataframe
(
wind.update_where(
(wind["p_year"] < 1887) | (wind["p_year"] > 2018), "p_year", np.nan
)
.update_where((wind["t_hh"] <= 0) | (wind["t_hh"] >= 1000), "t_hh", np.nan)
.update_where((wind["xlong"] < -161.76) | (wind["xlong"] > -68.01), "xlong", np.nan)
);
Confirming this produces the same result
wind.t_hh.mean()