Skip to content

Binder

Expose explicitly missing values with complete

import janitor
import numpy as np
import pandas as pd
# from http://imachordata.com/2016/02/05/you-complete-me/
df = pd.DataFrame(
    {
        "Year": [1999, 2000, 2004, 1999, 2004],
        "Taxon": [
            "Saccharina",
            "Saccharina",
            "Saccharina",
            "Agarum",
            "Agarum",
        ],
        "Abundance": [4, 5, 2, 1, 8],
    }
)

df
Year Taxon Abundance
0 1999 Saccharina 4
1 2000 Saccharina 5
2 2004 Saccharina 2
3 1999 Agarum 1
4 2004 Agarum 8

Note that Year 2000 and Agarum pairing is missing in the DataFrame above. Let’s make it explicit:

df.complete("Year", "Taxon")
Year Taxon Abundance
0 1999 Saccharina 4.0
1 2000 Saccharina 5.0
2 2004 Saccharina 2.0
3 1999 Agarum 1.0
4 2004 Agarum 8.0
5 2000 Agarum NaN
# A better viewing based on order
df.complete("Year", "Taxon", sort=True)
Year Taxon Abundance
0 1999 Agarum 1.0
1 1999 Saccharina 4.0
2 2000 Agarum NaN
3 2000 Saccharina 5.0
4 2004 Agarum 8.0
5 2004 Saccharina 2.0

What if we wanted the explicit missing values for all the years from 1999 to 2004? Easy - simply pass a dictionary pairing the column name with the new values:

new_year_values = {"Year": range(df.Year.min(), df.Year.max() + 1)}

df.complete(new_year_values, "Taxon")
Year Taxon Abundance
0 1999 Saccharina 4.0
1 2000 Saccharina 5.0
2 2004 Saccharina 2.0
3 1999 Agarum 1.0
4 2004 Agarum 8.0
5 2000 Agarum NaN
6 2001 Saccharina NaN
7 2001 Agarum NaN
8 2002 Saccharina NaN
9 2002 Agarum NaN
10 2003 Saccharina NaN
11 2003 Agarum NaN

You can pass a callable as values in the dictionary:

def new_year_values(year):
    return range(year.min(), year.max() + 1)


df.complete({"Year": new_year_values}, "Taxon", sort=True)
Year Taxon Abundance
0 1999 Agarum 1.0
1 1999 Saccharina 4.0
2 2000 Agarum NaN
3 2000 Saccharina 5.0
4 2001 Agarum NaN
5 2001 Saccharina NaN
6 2002 Agarum NaN
7 2002 Saccharina NaN
8 2003 Agarum NaN
9 2003 Saccharina NaN
10 2004 Agarum 8.0
11 2004 Saccharina 2.0

You can get explcit rows, based only on existing data:

# https://stackoverflow.com/q/62266057/7175713
df = {
    "Name": ("Bob", "Bob", "Emma"),
    "Age": (23, 23, 78),
    "Gender": ("Male", "Male", "Female"),
    "Item": ("house", "car", "house"),
    "Value": (5, 1, 3),
}
df = pd.DataFrame(df)
df
Name Age Gender Item Value
0 Bob 23 Male house 5
1 Bob 23 Male car 1
2 Emma 78 Female house 3

In the DataFrame above, there is no car Item value for the Name, Age, Gender combination -> (Emma, 78, Female). Pass (Name, Age, Gender) and Item to explicitly expose the missing row:

df.complete(("Name", "Age", "Gender"), "Item")
Name Age Gender Item Value
0 Bob 23 Male house 5.0
1 Bob 23 Male car 1.0
2 Emma 78 Female house 3.0
3 Emma 78 Female car NaN

The example above showed how to expose missing rows on a group basis. There is also the option of exposing missing rows with the by parameter:

df = pd.DataFrame(
    {
        "state": ["CA", "CA", "HI", "HI", "HI", "NY", "NY"],
        "year": [2010, 2013, 2010, 2012, 2016, 2009, 2013],
        "value": [1, 3, 1, 2, 3, 2, 5],
    }
)

df
state year value
0 CA 2010 1
1 CA 2013 3
2 HI 2010 1
3 HI 2012 2
4 HI 2016 3
5 NY 2009 2
6 NY 2013 5

Let's expose all the missing years, based on the minimum and maximum year, for each state:

result = df.complete({"year": new_year_values}, by="state", sort=True)

result
state year value
0 CA 2010 1.0
1 CA 2011 NaN
2 CA 2012 NaN
3 CA 2013 3.0
4 HI 2010 1.0
5 HI 2011 NaN
6 HI 2012 2.0
7 HI 2013 NaN
8 HI 2014 NaN
9 HI 2015 NaN
10 HI 2016 3.0
11 NY 2009 2.0
12 NY 2010 NaN
13 NY 2011 NaN
14 NY 2012 NaN
15 NY 2013 5.0

You can fill the nulls with Pandas' fillna:

result.fillna(0, downcast="infer")
state year value
0 CA 2010 1
1 CA 2011 0
2 CA 2012 0
3 CA 2013 3
4 HI 2010 1
5 HI 2011 0
6 HI 2012 2
7 HI 2013 0
8 HI 2014 0
9 HI 2015 0
10 HI 2016 3
11 NY 2009 2
12 NY 2010 0
13 NY 2011 0
14 NY 2012 0
15 NY 2013 5