Skip to content

Binder

Replacing Bad Values

This is US wind turbine data. The numeric fields use -9999 as a null value for missing data. Using -9999 as a null value in numeric fields will cause big problems for any summary statistics like totals, means, etc, we should change that to something else, like np.NaN which Pandas sum and mean functions will automatically filter out.
You can see that the means for before and after replacing -9999 with np.NaN are very different. You can use Janitor's find_replace to easily replace them.

import janitor
import numpy as np
import pandas as pd

Load Wind Turbine Data

wind = pd.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-11-06/us_wind.csv"
)
wind.head()
case_id faa_ors faa_asn usgs_pr_id t_state t_county t_fips p_name p_year p_tnum ... t_hh t_rd t_rsa t_ttlh t_conf_atr t_conf_loc t_img_date t_img_srce xlong ylat
0 3073429 missing missing 4960 CA Kern County 6029 251 Wind 1987 194 ... -9999.0 -9999.0 -9999.0 -9999.0 2 3 1/1/2012 NAIP -118.360725 35.083778
1 3071522 missing missing 4997 CA Kern County 6029 251 Wind 1987 194 ... -9999.0 -9999.0 -9999.0 -9999.0 2 3 1/1/2012 NAIP -118.361168 35.081512
2 3073425 missing missing 4957 CA Kern County 6029 251 Wind 1987 194 ... -9999.0 -9999.0 -9999.0 -9999.0 2 3 1/1/2012 NAIP -118.360420 35.084709
3 3071569 missing missing 5023 CA Kern County 6029 251 Wind 1987 194 ... -9999.0 -9999.0 -9999.0 -9999.0 2 3 7/31/2016 Digital Globe -118.364029 35.079418
4 3005252 missing missing 5768 CA Kern County 6029 251 Wind 1987 194 ... -9999.0 -9999.0 -9999.0 -9999.0 2 3 11/23/2017 Digital Globe -118.354286 35.085594

5 rows × 24 columns

Check Mean

wind.t_hh.mean()
-1069.986537767466

The t_hh column appears to be affected by -9999 values. What are all the columns that are affected?

[col for col in wind.columns if -9999 in wind[col].values]
['usgs_pr_id', 'p_year', 'p_cap', 't_cap', 't_hh', 't_rd', 't_rsa', 't_ttlh']

Note: When replacing the -9999 values you can make a copy of the dataframe to prevent making modifications to the original dataframe.

At first glance, it looks like the mean is negative, but this is only because of the bad values (-9999.0) throughout the column. To get the right mean, we should replace them!## Replace Bad Values with NaNs

mapping = {-9999.0: np.nan}
wind2 = wind.find_replace(
    usgs_pr_id=mapping,
    p_tnum=mapping,
    p_cap=mapping,
    t_cap=mapping,
    t_hh=mapping,
    t_rd=mapping,
    t_rsa=mapping,
    t_ttlh=mapping,
)
wind2.head()
case_id faa_ors faa_asn usgs_pr_id t_state t_county t_fips p_name p_year p_tnum ... t_hh t_rd t_rsa t_ttlh t_conf_atr t_conf_loc t_img_date t_img_srce xlong ylat
0 3073429 missing missing 4960.0 CA Kern County 6029 251 Wind 1987 194 ... NaN NaN NaN NaN 2 3 1/1/2012 NAIP -118.360725 35.083778
1 3071522 missing missing 4997.0 CA Kern County 6029 251 Wind 1987 194 ... NaN NaN NaN NaN 2 3 1/1/2012 NAIP -118.361168 35.081512
2 3073425 missing missing 4957.0 CA Kern County 6029 251 Wind 1987 194 ... NaN NaN NaN NaN 2 3 1/1/2012 NAIP -118.360420 35.084709
3 3071569 missing missing 5023.0 CA Kern County 6029 251 Wind 1987 194 ... NaN NaN NaN NaN 2 3 7/31/2016 Digital Globe -118.364029 35.079418
4 3005252 missing missing 5768.0 CA Kern County 6029 251 Wind 1987 194 ... NaN NaN NaN NaN 2 3 11/23/2017 Digital Globe -118.354286 35.085594

5 rows × 24 columns

Check the Mean (again)

wind2.t_hh.mean()
77.31203064391

And, now that the bad values were replaced by NaNs (which the mean() method ignores), the calculated mean is correct!

Alternate method

If we look at the description of the data (see README) we can find descriptions for our data values, for example:

  • p_year: Year project became operational
  • t_hh: Turbine hub height (meters)
  • xlong: Longitude

Using our knowledge of the data, this would give us bounds we could use for these values. For example, the oldest electric wind turbine was built in 1887 and this document was written in 2018, so $1887 \leq \text{p_year} \leq 2018$. Turbine hub height should be positive, and a value above 1 km would be silly, so $0 < \text{t_hh} < 1000$. These are wind turbines in the United States, so $-161.76 < \text{xlong} < -68.01$.

(Note that the README actually gives us minima and maxima for the data, so we could get much tighter bounds from that.)

To filter out potential bad values, we will use update_where to remove values outside these ranges.

# Note that update_where mutates the original dataframe
(
    wind.update_where(
        (wind["p_year"] < 1887) | (wind["p_year"] > 2018), "p_year", np.nan
    )
    .update_where((wind["t_hh"] <= 0) | (wind["t_hh"] >= 1000), "t_hh", np.nan)
    .update_where((wind["xlong"] < -161.76) | (wind["xlong"] > -68.01), "xlong", np.nan)
);

Confirming this produces the same result

wind.t_hh.mean()
77.31203064391