Skip to content

Binder

Groupby_agg : Shortcut for assigning a groupby-transform to a new column.

Background

This notebook serves to show how to use the groupby_agg method from pyjanitor's general functions submodule.

The groupby_agg method allows us to add the result of an aggregation from a grouping, as a new column, back to the dataframe.

Currently in pandas, to append a column back to a dataframe, you do it in three steps: 1. Groupby a column or columns 2. Apply the transform method with an aggregate function on the grouping, and finally 3. Assign the result of the transform to a new column in the dataframe.

The groupby_agg allows you to replicate the same process in one step and with sensible arguments. The example below illustrates this function

# load modules
import numpy as np
import pandas as pd
from janitor import groupby_agg
data = {
    "item": ["shoe", "shoe", "bag", "shoe", "bag"],
    "MRP": [220, 450, 320, 200, 305],
    "number_sold": [100, 40, 56, 38, 25],
}

df = pd.DataFrame(data)

df
item MRP number_sold
0 shoe 220 100
1 shoe 450 40
2 bag 320 56
3 shoe 200 38
4 bag 305 25

Use grouby_agg to find average price for each item and append column to dataframe

df = df.groupby_agg(
    by="item", agg="mean", agg_column_name="MRP", new_column_name="Avg_MRP"
)

df
item MRP number_sold Avg_MRP
0 shoe 220 100 290.0
1 shoe 450 40 290.0
2 bag 320 56 312.5
3 shoe 200 38 290.0
4 bag 305 25 312.5

Null cells are retained as well :

df = pd.DataFrame(
    {
        "name": ("black", "black", "black", "red", "red"),
        "type": ("chair", "chair", "sofa", "sofa", "plate"),
        "num": (4, 5, 12, 4, 3),
        "nulls": (1, 1, np.nan, np.nan, 3),
    }
)

df
name type num nulls
0 black chair 4 1.0
1 black chair 5 1.0
2 black sofa 12 NaN
3 red sofa 4 NaN
4 red plate 3 3.0
filtered_df = df.groupby_agg(
    by=["nulls"], agg="size", agg_column_name="type", new_column_name="counter"
)
filtered_df
name type num nulls counter
0 black chair 4 1.0 2.0
1 black chair 5 1.0 2.0
2 black sofa 12 NaN NaN
3 red sofa 4 NaN NaN
4 red plate 3 3.0 1.0

The groupby_agg method can be extended for different purposes. One of these is groupwise filtering, where only groups that meet a condition are retained. Let's explore this with an example:

filtered_df = df.groupby_agg(
    by=["name", "type"], agg="size", agg_column_name="type", new_column_name="counter"
).query("counter > 1")
filtered_df
name type num nulls counter
0 black chair 4 1.0 2
1 black chair 5 1.0 2