Pivot Data from Long to Wide Form
import janitor as jn
import pandas as pd
df = [
{"name": "Alice", "variable": "wk1", "value": 5},
{"name": "Alice", "variable": "wk2", "value": 9},
{"name": "Alice", "variable": "wk3", "value": 20},
{"name": "Alice", "variable": "wk4", "value": 22},
{"name": "Bob", "variable": "wk1", "value": 7},
{"name": "Bob", "variable": "wk2", "value": 11},
{"name": "Bob", "variable": "wk3", "value": 17},
{"name": "Bob", "variable": "wk4", "value": 33},
{"name": "Carla", "variable": "wk1", "value": 6},
{"name": "Carla", "variable": "wk2", "value": 13},
{"name": "Carla", "variable": "wk3", "value": 39},
{"name": "Carla", "variable": "wk4", "value": 40},
]
df = pd.DataFrame(df)
df
Reshaping to wide form:
df.pivot_wider(index="name", names_from="variable", values_from="value")
Pivoting on multiple columns is possible :
df = [
{"name": 1, "n": 10.0, "pct": 0.1},
{"name": 2, "n": 20.0, "pct": 0.2},
{"name": 3, "n": 30.0, "pct": 0.3},
]
df = pd.DataFrame(df)
df
(
df.assign(num=0).pivot_wider(
index="num", names_from="name", values_from=["n", "pct"], names_sep="_"
)
)
You may choose not to flatten the columns, by setting flatten_levels
to False
:
df = [
{"dep": 5.5, "step": 1, "a": 20, "b": 30},
{"dep": 5.5, "step": 2, "a": 25, "b": 37},
{"dep": 6.1, "step": 1, "a": 22, "b": 19},
{"dep": 6.1, "step": 2, "a": 18, "b": 29},
]
df = pd.DataFrame(df)
df
df.pivot_wider(index="dep", names_from="step", flatten_levels=False)
The order of the levels can be changed with the levels_order
parameter, which internally uses pandas' reorder_levels:
df.pivot_wider(
index="dep", names_from="step", flatten_levels=False, levels_order=["step", None]
)
df.pivot_wider(
index="dep",
names_from="step",
flatten_levels=True,
)
df.pivot_wider(
index="dep", names_from="step", flatten_levels=True, levels_order=["step", None]
)
names_sep
and names_glue
come in handy in situations where names_from
and/or values_from
contain multiple variables; it is used primarily when the columns are flattened. The default value for names_sep
is _
:
# default value of names_sep is '_'
df.pivot_wider(index="dep", names_from="step")
df.pivot_wider(index="dep", names_from="step", names_sep="")
With names_glue
you can glue the individual levels (if MultiIndex) into one (similar to names_sep
), or you can modify the final columns, as long as it can be passed to pd.Index.map
:
# replicate `names_sep`
df.pivot_wider(index="dep", names_from="step", names_sep=None, names_glue="_".join)
# going beyond names_sep
df.pivot_wider(
index="dep",
names_from="step",
names_sep=None,
names_glue=lambda col: f"{col[0]}_step{col[1]}",
)
There are scenarios where the column order of the final dataframe is important:
df = [
{"Salesman": "Knut", "Height": 6, "product": "bat", "price": 5},
{"Salesman": "Knut", "Height": 6, "product": "ball", "price": 1},
{"Salesman": "Knut", "Height": 6, "product": "wand", "price": 3},
{"Salesman": "Steve", "Height": 5, "product": "pen", "price": 2},
]
df = pd.DataFrame(df)
df
idx = df.groupby(["Salesman", "Height"]).cumcount().add(1)
(df.assign(idx=idx).pivot_wider(index=["Salesman", "Height"], names_from="idx"))
To get the columns in a form where product
alternates with price
, we can combine pivot_wider
(or plain pd.pivot
) with pd.sort_index
and janitor.collapse_levels
:
(
df.assign(idx=idx)
.pivot_wider(index=["Salesman", "Height"], names_from="idx", flatten_levels=False)
.sort_index(level="idx", axis="columns", sort_remaining=False)
.collapse_levels()
.reset_index()
)
df = pd.DataFrame(
{
"geoid": [1, 1, 13, 13],
"name": ["Alabama", "Alabama", "Georgia", "Georgia"],
"variable": [
"pop_renter",
"median_rent",
"pop_renter",
"median_rent",
],
"estimate": [1434765, 747, 3592422, 927],
"error": [16736, 3, 33385, 3],
}
)
df
df.pivot_wider(
index=["geoid", "name"],
names_from="variable",
values_from=["estimate", "error"],
levels_order=["variable", None],
)
For the reshaping above, we would like to maintain the order in variable
, where pop_renter
comes before median_rent
; this can be achieved by converting the variable
column to a categorical, before reshaping:
(
df.encode_categorical(variable=(None, "appearance")).pivot_wider(
index=["geoid", "name"],
names_from="variable",
values_from=["estimate", "error"],
levels_order=["variable", None],
)
)