from random import choice
import janitor
import numpy as np
import pandas as pd
Transforming columns
Introduction
There are two ways to use the transform_column
function: by passing in a function that operates elementwise, or by passing in a function that operates columnwise.
We will show you both in this notebook.
Numeric Data
data = np.random.normal(size=(1_000_000, 4))
df = pd.DataFrame(data).clean_names()
Using the elementwise application:
%%timeit
# We are using a lambda function that operates on each element,
# to highlight the point about elementwise operations.
df.transform_column("0", lambda x: np.abs(x), "abs_0")
And now using columnwise application:
%%timeit
df.transform_column("0", lambda s: np.abs(s), elementwise=False)
Because np.abs
is vectorizable over the entire series,
it runs about 50X faster.
If you know your function is vectorizable,
then take advantage of the fact,
and use it inside transform_column
.
After all, all that transform_column
has done
is provide a method-chainable way of applying the function.
String Data
Let's see it in action with string-type data.
def make_strings(length: int):
return "".join(choice("ABCDEFGHIJKLMNOPQRSTUVWXYZ") for _ in range(length))
strings = (make_strings(30) for _ in range(1_000_000))
stringdf = pd.DataFrame({"data": list(strings)})
Firstly, by raw function application:
def first_five(s):
return s.str[0:5]
%%timeit
stringdf.assign(data=first_five(stringdf["data"]))
%%timeit
first_five(stringdf["data"])
%%timeit
stringdf["data"].str[0:5]
%%timeit
stringdf["data"].apply(lambda x: x[0:5])
It appears assigning the result to a column comes with a bit of overhead.
Now, by using transform_column
with default settings:
%%timeit
stringdf.transform_column("data", lambda x: x[0:5])
Now by using transform_column
while also leveraging string methods:
%%timeit
stringdf.transform_column("data", first_five, elementwise=False)