Skip to content

Binder

from random import choice

import janitor
import numpy as np
import pandas as pd

Transforming columns

Introduction

There are two ways to use the transform_column function: by passing in a function that operates elementwise, or by passing in a function that operates columnwise.

We will show you both in this notebook.

Numeric Data

data = np.random.normal(size=(1_000_000, 4))
df = pd.DataFrame(data).clean_names()

Using the elementwise application:

%%timeit
# We are using a lambda function that operates on each element,
# to highlight the point about elementwise operations.
df.transform_column("0", lambda x: np.abs(x), "abs_0")
1.08 s ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And now using columnwise application:

%%timeit
df.transform_column("0", lambda s: np.abs(s), elementwise=False)
8.5 ms ± 70.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Because np.abs is vectorizable over the entire series, it runs about 50X faster. If you know your function is vectorizable, then take advantage of the fact, and use it inside transform_column. After all, all that transform_column has done is provide a method-chainable way of applying the function.

String Data

Let's see it in action with string-type data.

def make_strings(length: int):
    return "".join(choice("ABCDEFGHIJKLMNOPQRSTUVWXYZ") for _ in range(length))


strings = (make_strings(30) for _ in range(1_000_000))

stringdf = pd.DataFrame({"data": list(strings)})

Firstly, by raw function application:

def first_five(s):
    return s.str[0:5]
%%timeit
stringdf.assign(data=first_five(stringdf["data"]))
279 ms ± 616 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
first_five(stringdf["data"])
194 ms ± 742 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
stringdf["data"].str[0:5]
194 ms ± 166 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
stringdf["data"].apply(lambda x: x[0:5])
183 ms ± 168 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

It appears assigning the result to a column comes with a bit of overhead.

Now, by using transform_column with default settings:

%%timeit
stringdf.transform_column("data", lambda x: x[0:5])
211 ms ± 740 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Now by using transform_column while also leveraging string methods:

%%timeit
stringdf.transform_column("data", first_five, elementwise=False)
279 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)