Groupby_agg : Shortcut for assigning a groupby-transform to a new column.

Background

This notebook serves to show how to use the groupby_agg method from pyjanitor's general functions submodule.

The groupby_agg method allows us to add the result of an aggregation from a grouping, as a new column, back to the dataframe.

Currently in pandas, to append a column back to a dataframe, you do it in three steps: 1. Groupby a column or columns 2. Apply the transform method with an aggregate function on the grouping, and finally 3. Assign the result of the transform to a new column in the dataframe.

The groupby_agg allows you to replicate the same process in one step and with sensible arguments. The example below illustrates this function

# load modules
import numpy as np
import pandas as pd
from janitor import groupby_agg

data = {
    "item": ["shoe", "shoe", "bag", "shoe", "bag"],
    "MRP": [220, 450, 320, 200, 305],
    "number_sold": [100, 40, 56, 38, 25],
}

df = pd.DataFrame(data)

df

	item	MRP	number_sold
0	shoe	220	100
1	shoe	450	40
2	bag	320	56
3	shoe	200	38
4	bag	305	25

Use grouby_agg to find average price for each item and append column to dataframe

df = df.groupby_agg(
    by="item", agg="mean", agg_column_name="MRP", new_column_name="Avg_MRP"
)

df

	item	MRP	number_sold	Avg_MRP
0	shoe	220	100	290.0
1	shoe	450	40	290.0
2	bag	320	56	312.5
3	shoe	200	38	290.0
4	bag	305	25	312.5

Null cells are retained as well :

df = pd.DataFrame(
    {
        "name": ("black", "black", "black", "red", "red"),
        "type": ("chair", "chair", "sofa", "sofa", "plate"),
        "num": (4, 5, 12, 4, 3),
        "nulls": (1, 1, np.nan, np.nan, 3),
    }
)

df

	name	type	num	nulls
0	black	chair	4	1.0
1	black	chair	5	1.0
2	black	sofa	12	NaN
3	red	sofa	4	NaN
4	red	plate	3	3.0

filtered_df = df.groupby_agg(
    by=["nulls"], agg="size", agg_column_name="type", new_column_name="counter"
)
filtered_df

	name	type	num	nulls	counter
0	black	chair	4	1.0	2.0
1	black	chair	5	1.0	2.0
2	black	sofa	12	NaN	NaN
3	red	sofa	4	NaN	NaN
4	red	plate	3	3.0	1.0

The groupby_agg method can be extended for different purposes. One of these is groupwise filtering, where only groups that meet a condition are retained. Let's explore this with an example:

filtered_df = df.groupby_agg(
    by=["name", "type"], agg="size", agg_column_name="type", new_column_name="counter"
).query("counter > 1")
filtered_df

	name	type	num	nulls	counter
0	black	chair	4	1.0	2
1	black	chair	5	1.0	2