Multiple Conditions with case_when
import janitor
import pandas as pd
from janitor.functions import case_when
janitor.__version__
# https://stackoverflow.com/q/19913659/7175713
df = pd.DataFrame({"col1": list("ABBC"), "col2": list("ZZXY")})
df
- Single Condition:
df.case_when(
df.col1 == "Z", # condition
"green", # value if True
"red", # value if False
column_name="color",
)
- Multiple Conditions:
df.case_when(
df.col2.eq("Z") & df.col1.eq("A"),
"yellow", # first condition and value
df.col2.eq("Z") & df.col1.eq("B"),
"blue", # second condition and value
df.col1.eq("B"),
"purple", # third condition and value
"black", # default if no condition is True
column_name="color",
)
Anonymous functions (lambda) are supported as well:
# https://stackoverflow.com/q/43391591/7175713
raw_data = {"age1": [23, 45, 21], "age2": [10, 20, 50]}
df = pd.DataFrame(raw_data, columns=["age1", "age2"])
df
df.case_when(
lambda df: (df.age1 - df.age2) > 0, # condition
lambda df: df.age1 - df.age2, # value if True
lambda df: df.age2 - df.age1, # default if False
column_name="diff",
)
data types are preserved; under the hood it uses pd.Series.mask:
df = df.astype("Int64")
df.dtypes
result = df.case_when(
lambda df: (df.age1 - df.age2) > 0,
lambda df: df.age1 - df.age2,
lambda df: df.age2 - df.age1,
column_name="diff",
)
result
result.dtypes
The conditions can be a string, as long as they can be evaluated with pd.eval
on the DataFrame, and return a boolean array:
# https://stackoverflow.com/q/54653356/7175713
data = {
"name": ["Jason", "Molly", "Tina", "Jake", "Amy"],
"age": [42, 52, 36, 24, 73],
"preTestScore": [4, 24, 31, 2, 3],
"postTestScore": [25, 94, 57, 62, 70],
}
df = pd.DataFrame(data, columns=["name", "age", "preTestScore", "postTestScore"])
df
df.case_when(
"age < 10",
"baby",
"10 <= age < 20",
"kid",
"20 <= age < 30",
"young",
"30 <= age < 50",
"mature",
"grandpa",
column_name="elderly",
)
When multiple conditions are satisfied, the first one is used:
df = range(3, 30, 3)
df = pd.DataFrame(df, columns=["odd"])
df
df.case_when(
df.odd % 9 == 0, "divisible by 9", "divisible by 3", column_name="div_by_3_or_9"
)
lines 2, 5 and 8 are divisible by 3; however, because the first condition tests if it is divisible by 9, that outcome is used instead.
If column_name
exists in the DataFrame, then that column's values will be replaced with the outcome of case_when
:
df.case_when(df.odd % 9 == 0, "divisible by 9", "divisible by 3", column_name="odd")