Skip to content

Functions

General Functions

pyjanitor's general-purpose data cleaning functions.

NOTE: Instructions for future contributors:

  1. Place the source code of the functions in a file named after the function.
  2. Place utility functions in the same file.
  3. If you use a utility function from another source file, please refactor it out to janitor.functions.utils.
  4. Import the function into this file so that it shows up in the top-level API.
  5. Sort the imports in alphabetical order.
  6. Try to group related functions together (e.g. see convert_date.py)
  7. Never import utils.

DropLabel dataclass

Helper class for removing labels within the select syntax.

label can be any of the types supported in the select, select_rows and select_columns functions. An array of integers not matching the labels is returned.

New in version 0.24.0

Parameters:

Name Type Description Default
label Any

Label(s) to be dropped from the index.

required
Source code in janitor/functions/utils.py
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
@dataclass
class DropLabel:
    """Helper class for removing labels within the `select` syntax.

    `label` can be any of the types supported in the `select`,
    `select_rows` and `select_columns` functions.
    An array of integers not matching the labels is returned.

    !!! info "New in version 0.24.0"

    Args:
        label: Label(s) to be dropped from the index.
    """

    label: Any

col

Helper class for column selection within an expression.

Parameters:

Name Type Description Default
column Hashable

The name of the column to be selected.

required

Raises:

Type Description
TypeError

If the column parameter is not hashable.

New in version 0.25.0

Warning

col is currently considered experimental. The implementation and parts of the API may change without warning.

Source code in janitor/functions/utils.py
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
class col:
    """Helper class for column selection within an expression.

    Args:
        column (Hashable): The name of the column to be selected.

    Raises:
        TypeError: If the `column` parameter is not hashable.

    !!! info "New in version 0.25.0"

    !!! warning

        `col` is currently considered experimental.
        The implementation and parts of the API
        may change without warning.

    """

    def __init__(self, column: Hashable):
        self.cols = column
        check("column", self.cols, [Hashable])
        self.join_args = None

    def __gt__(self, other):
        """Implements the greater-than comparison operator (`>`).

        Args:
            other (col): The other `col` object to compare to.

        Returns:
            col: The current `col` object.
        """
        self.join_args = (self.cols, other.cols, ">")
        return self

    def __ge__(self, other):
        """Implements the greater-than-or-equal-to comparison operator (`>=`).

        Args:
            other (col): The other `col` object to compare to.

        Returns:
            col: The current `col` object.
        """
        self.join_args = (self.cols, other.cols, ">=")
        return self

    def __lt__(self, other):
        """Implements the less-than comparison operator (`<`).

        Args:
            other (col): The other `col` object to compare to.

        Returns:
            col: The current `col` object.
        """
        self.join_args = (self.cols, other.cols, "<")
        return self

    def __le__(self, other):
        """Implements the less-than-or-equal-to comparison operator (`<=`).

        Args:
            other (col): The other `col` object to compare to.

        Returns:
            col: The current `col` object.
        """
        self.join_args = (self.cols, other.cols, "<=")
        return self

    def __ne__(self, other):
        """Implements the not-equal-to comparison operator (`!=`).

        Args:
            other (col): The other `col` object to compare to.

        Returns:
            col: The current `col` object.
        """
        self.join_args = (self.cols, other.cols, "!=")
        return self

    def __eq__(self, other):
        """Implements the equal-to comparison operator (`==`).

        Args:
            other (col): The other `col` object to compare to.

        Returns:
            col: The current `col` object.
        """
        self.join_args = (self.cols, other.cols, "==")
        return self

__eq__(other)

Implements the equal-to comparison operator (==).

Parameters:

Name Type Description Default
other col

The other col object to compare to.

required

Returns:

Name Type Description
col

The current col object.

Source code in janitor/functions/utils.py
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
def __eq__(self, other):
    """Implements the equal-to comparison operator (`==`).

    Args:
        other (col): The other `col` object to compare to.

    Returns:
        col: The current `col` object.
    """
    self.join_args = (self.cols, other.cols, "==")
    return self

__ge__(other)

Implements the greater-than-or-equal-to comparison operator (>=).

Parameters:

Name Type Description Default
other col

The other col object to compare to.

required

Returns:

Name Type Description
col

The current col object.

Source code in janitor/functions/utils.py
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
def __ge__(self, other):
    """Implements the greater-than-or-equal-to comparison operator (`>=`).

    Args:
        other (col): The other `col` object to compare to.

    Returns:
        col: The current `col` object.
    """
    self.join_args = (self.cols, other.cols, ">=")
    return self

__gt__(other)

Implements the greater-than comparison operator (>).

Parameters:

Name Type Description Default
other col

The other col object to compare to.

required

Returns:

Name Type Description
col

The current col object.

Source code in janitor/functions/utils.py
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
def __gt__(self, other):
    """Implements the greater-than comparison operator (`>`).

    Args:
        other (col): The other `col` object to compare to.

    Returns:
        col: The current `col` object.
    """
    self.join_args = (self.cols, other.cols, ">")
    return self

__le__(other)

Implements the less-than-or-equal-to comparison operator (<=).

Parameters:

Name Type Description Default
other col

The other col object to compare to.

required

Returns:

Name Type Description
col

The current col object.

Source code in janitor/functions/utils.py
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
def __le__(self, other):
    """Implements the less-than-or-equal-to comparison operator (`<=`).

    Args:
        other (col): The other `col` object to compare to.

    Returns:
        col: The current `col` object.
    """
    self.join_args = (self.cols, other.cols, "<=")
    return self

__lt__(other)

Implements the less-than comparison operator (<).

Parameters:

Name Type Description Default
other col

The other col object to compare to.

required

Returns:

Name Type Description
col

The current col object.

Source code in janitor/functions/utils.py
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
def __lt__(self, other):
    """Implements the less-than comparison operator (`<`).

    Args:
        other (col): The other `col` object to compare to.

    Returns:
        col: The current `col` object.
    """
    self.join_args = (self.cols, other.cols, "<")
    return self

__ne__(other)

Implements the not-equal-to comparison operator (!=).

Parameters:

Name Type Description Default
other col

The other col object to compare to.

required

Returns:

Name Type Description
col

The current col object.

Source code in janitor/functions/utils.py
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
def __ne__(self, other):
    """Implements the not-equal-to comparison operator (`!=`).

    Args:
        other (col): The other `col` object to compare to.

    Returns:
        col: The current `col` object.
    """
    self.join_args = (self.cols, other.cols, "!=")
    return self

convert_excel_date(df, column_name)

Convert Excel's serial date format into Python datetime format.

This method mutates the original DataFrame.

Implementation is also from Stack Overflow.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"date": [39690, 39690, 37118]})
>>> df
    date
0  39690
1  39690
2  37118
>>> df.convert_excel_date('date')
        date
0 2008-08-30
1 2008-08-30
2 2001-08-15

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_name Hashable

A column name.

required

Raises:

Type Description
ValueError

If there are non numeric values in the column.

Returns:

Type Description
DataFrame

A pandas DataFrame with corrected dates.

Source code in janitor/functions/convert_date.py
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
@pf.register_dataframe_method
@deprecated_alias(column="column_name")
def convert_excel_date(
    df: pd.DataFrame, column_name: Hashable
) -> pd.DataFrame:
    """Convert Excel's serial date format into Python datetime format.

    This method mutates the original DataFrame.

    Implementation is also from
    [Stack Overflow](https://stackoverflow.com/questions/38454403/convert-excel-style-date-with-pandas).

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"date": [39690, 39690, 37118]})
        >>> df
            date
        0  39690
        1  39690
        2  37118
        >>> df.convert_excel_date('date')
                date
        0 2008-08-30
        1 2008-08-30
        2 2001-08-15

    Args:
        df: A pandas DataFrame.
        column_name: A column name.

    Raises:
        ValueError: If there are non numeric values in the column.

    Returns:
        A pandas DataFrame with corrected dates.
    """  # noqa: E501

    if not is_numeric_dtype(df[column_name]):
        raise ValueError(
            "There are non-numeric values in the column. "
            "All values must be numeric."
        )

    df[column_name] = pd.TimedeltaIndex(
        df[column_name], unit="d"
    ) + dt.datetime(
        1899, 12, 30
    )  # noqa: W503
    return df

convert_matlab_date(df, column_name)

Convert Matlab's serial date number into Python datetime format.

Implementation is also from Stack Overflow.

This method mutates the original DataFrame.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"date": [737125.0, 737124.815863, 737124.4985, 737124]})
>>> df
            date
0  737125.000000
1  737124.815863
2  737124.498500
3  737124.000000
>>> df.convert_matlab_date('date')
                        date
0 2018-03-06 00:00:00.000000
1 2018-03-05 19:34:50.563200
2 2018-03-05 11:57:50.399999
3 2018-03-05 00:00:00.000000

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_name Hashable

A column name.

required

Returns:

Type Description
DataFrame

A pandas DataFrame with corrected dates.

Source code in janitor/functions/convert_date.py
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
@pf.register_dataframe_method
@deprecated_alias(column="column_name")
def convert_matlab_date(
    df: pd.DataFrame, column_name: Hashable
) -> pd.DataFrame:
    """Convert Matlab's serial date number into Python datetime format.

    Implementation is also from
    [Stack Overflow](https://stackoverflow.com/questions/13965740/converting-matlabs-datenum-format-to-python).

    This method mutates the original DataFrame.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"date": [737125.0, 737124.815863, 737124.4985, 737124]})
        >>> df
                    date
        0  737125.000000
        1  737124.815863
        2  737124.498500
        3  737124.000000
        >>> df.convert_matlab_date('date')
                                date
        0 2018-03-06 00:00:00.000000
        1 2018-03-05 19:34:50.563200
        2 2018-03-05 11:57:50.399999
        3 2018-03-05 00:00:00.000000

    Args:
        df: A pandas DataFrame.
        column_name: A column name.

    Returns:
        A pandas DataFrame with corrected dates.
    """  # noqa: E501
    days = pd.Series([dt.timedelta(v % 1) for v in df[column_name]])
    df[column_name] = (
        df[column_name].astype(int).apply(dt.datetime.fromordinal)
        + days
        - dt.timedelta(days=366)
    )
    return df

convert_unix_date(df, column_name)

Convert unix epoch time into Python datetime format.

Note that this ignores local tz and convert all timestamps to naive datetime based on UTC!

This method mutates the original DataFrame.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"date": [1651510462, 53394822, 1126233195]})
>>> df
         date
0  1651510462
1    53394822
2  1126233195
>>> df.convert_unix_date('date')
                 date
0 2022-05-02 16:54:22
1 1971-09-10 23:53:42
2 2005-09-09 02:33:15

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_name Hashable

A column name.

required

Returns:

Type Description
DataFrame

A pandas DataFrame with corrected dates.

Source code in janitor/functions/convert_date.py
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
@pf.register_dataframe_method
@deprecated_alias(column="column_name")
def convert_unix_date(df: pd.DataFrame, column_name: Hashable) -> pd.DataFrame:
    """Convert unix epoch time into Python datetime format.

    Note that this ignores local tz and convert all timestamps to naive
    datetime based on UTC!

    This method mutates the original DataFrame.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"date": [1651510462, 53394822, 1126233195]})
        >>> df
                 date
        0  1651510462
        1    53394822
        2  1126233195
        >>> df.convert_unix_date('date')
                         date
        0 2022-05-02 16:54:22
        1 1971-09-10 23:53:42
        2 2005-09-09 02:33:15

    Args:
        df: A pandas DataFrame.
        column_name: A column name.

    Returns:
        A pandas DataFrame with corrected dates.
    """

    try:
        df[column_name] = pd.to_datetime(df[column_name], unit="s")
    except OutOfBoundsDatetime:  # Indicates time is in milliseconds.
        df[column_name] = pd.to_datetime(df[column_name], unit="ms")
    return df

fill_direction(df, **kwargs)

Provide a method-chainable function for filling missing values in selected columns.

It is a wrapper for pd.Series.ffill and pd.Series.bfill, and pairs the column name with one of up, down, updown, and downup.

Note

This function will be deprecated in a 1.x release. Please use pd.DataFrame.assign instead.

Examples:

>>> import pandas as pd
>>> import janitor as jn
>>> df = pd.DataFrame(
...    {
...        'col1': [1, 2, 3, 4],
...        'col2': [None, 5, 6, 7],
...        'col3': [8, 9, 10, None],
...        'col4': [None, None, 11, None],
...        'col5': [None, 12, 13, None]
...    }
... )
>>> df
   col1  col2  col3  col4  col5
0     1   NaN   8.0   NaN   NaN
1     2   5.0   9.0   NaN  12.0
2     3   6.0  10.0  11.0  13.0
3     4   7.0   NaN   NaN   NaN
>>> df.fill_direction(
... col2 = 'up',
... col3 = 'down',
... col4 = 'downup',
... col5 = 'updown'
... )
   col1  col2  col3  col4  col5
0     1   5.0   8.0  11.0  12.0
1     2   5.0   9.0  11.0  12.0
2     3   6.0  10.0  11.0  13.0
3     4   7.0  10.0  11.0  13.0

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
**kwargs Any

Key - value pairs of columns and directions. Directions can be either down, up, updown (fill up then down) and downup (fill down then up).

{}

Raises:

Type Description
ValueError

If direction supplied is not one of down, up, updown, or downup.

Returns:

Type Description
DataFrame

A pandas DataFrame with modified column(s).

Source code in janitor/functions/fill.py
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
@pf.register_dataframe_method
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `pd.DataFrame.assign` instead."
    )
)
def fill_direction(df: pd.DataFrame, **kwargs: Any) -> pd.DataFrame:
    """Provide a method-chainable function for filling missing values
    in selected columns.

    It is a wrapper for `pd.Series.ffill` and `pd.Series.bfill`,
    and pairs the column name with one of `up`, `down`, `updown`,
    and `downup`.

    !!!note

        This function will be deprecated in a 1.x release.
        Please use `pd.DataFrame.assign` instead.

    Examples:
        >>> import pandas as pd
        >>> import janitor as jn
        >>> df = pd.DataFrame(
        ...    {
        ...        'col1': [1, 2, 3, 4],
        ...        'col2': [None, 5, 6, 7],
        ...        'col3': [8, 9, 10, None],
        ...        'col4': [None, None, 11, None],
        ...        'col5': [None, 12, 13, None]
        ...    }
        ... )
        >>> df
           col1  col2  col3  col4  col5
        0     1   NaN   8.0   NaN   NaN
        1     2   5.0   9.0   NaN  12.0
        2     3   6.0  10.0  11.0  13.0
        3     4   7.0   NaN   NaN   NaN
        >>> df.fill_direction(
        ... col2 = 'up',
        ... col3 = 'down',
        ... col4 = 'downup',
        ... col5 = 'updown'
        ... )
           col1  col2  col3  col4  col5
        0     1   5.0   8.0  11.0  12.0
        1     2   5.0   9.0  11.0  12.0
        2     3   6.0  10.0  11.0  13.0
        3     4   7.0  10.0  11.0  13.0

    Args:
        df: A pandas DataFrame.
        **kwargs: Key - value pairs of columns and directions.
            Directions can be either `down`, `up`, `updown`
            (fill up then down) and `downup` (fill down then up).

    Raises:
        ValueError: If direction supplied is not one of `down`, `up`,
            `updown`, or `downup`.

    Returns:
        A pandas DataFrame with modified column(s).
    """  # noqa: E501

    if not kwargs:
        return df

    fill_types = {fill.name for fill in _FILLTYPE}
    for column_name, fill_type in kwargs.items():
        check("column_name", column_name, [str])
        check("fill_type", fill_type, [str])
        if fill_type.upper() not in fill_types:
            raise ValueError(
                "fill_type should be one of up, down, updown, or downup."
            )

    check_column(df, kwargs)

    new_values = {}
    for column_name, fill_type in kwargs.items():
        direction = _FILLTYPE[f"{fill_type.upper()}"].value
        if len(direction) == 1:
            direction = methodcaller(direction[0])
            output = direction(df[column_name])
        else:
            direction = [methodcaller(entry) for entry in direction]
            output = _chain_func(df[column_name], *direction)
        new_values[column_name] = output

    return df.assign(**new_values)

fill_empty(df, column_names, value)

Fill NaN values in specified columns with a given value.

Super sugary syntax that wraps pandas.DataFrame.fillna.

This method mutates the original DataFrame.

Note

This function will be deprecated in a 1.x release. Please use jn.impute instead.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame(
...        {
...            'col1': [1, 2, 3],
...            'col2': [None, 4, None ],
...            'col3': [None, 5, 6]
...        }
...    )
>>> df
   col1  col2  col3
0     1   NaN   NaN
1     2   4.0   5.0
2     3   NaN   6.0
>>> df.fill_empty(column_names = 'col2', value = 0)
   col1  col2  col3
0     1   0.0   NaN
1     2   4.0   5.0
2     3   0.0   6.0
>>> df.fill_empty(column_names = ['col2', 'col3'], value = 0)
   col1  col2  col3
0     1   0.0   0.0
1     2   4.0   5.0
2     3   0.0   6.0

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_names Union[str, Iterable[str], Hashable]

A column name or an iterable (list or tuple) of column names. If a single column name is passed in, then only that column will be filled; if a list or tuple is passed in, then those columns will all be filled with the same value.

required
value Any

The value that replaces the NaN values.

required

Returns:

Type Description
DataFrame

A pandas DataFrame with NaN values filled.

Source code in janitor/functions/fill.py
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
@pf.register_dataframe_method
@refactored_function(
    message="This function will be deprecated in a 1.x release. "
    "Kindly use `jn.impute` instead."
)
@deprecated_alias(columns="column_names")
def fill_empty(
    df: pd.DataFrame,
    column_names: Union[str, Iterable[str], Hashable],
    value: Any,
) -> pd.DataFrame:
    """Fill `NaN` values in specified columns with a given value.

    Super sugary syntax that wraps `pandas.DataFrame.fillna`.

    This method mutates the original DataFrame.

    !!!note

        This function will be deprecated in a 1.x release.
        Please use [`jn.impute`][janitor.functions.impute.impute] instead.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame(
        ...        {
        ...            'col1': [1, 2, 3],
        ...            'col2': [None, 4, None ],
        ...            'col3': [None, 5, 6]
        ...        }
        ...    )
        >>> df
           col1  col2  col3
        0     1   NaN   NaN
        1     2   4.0   5.0
        2     3   NaN   6.0
        >>> df.fill_empty(column_names = 'col2', value = 0)
           col1  col2  col3
        0     1   0.0   NaN
        1     2   4.0   5.0
        2     3   0.0   6.0
        >>> df.fill_empty(column_names = ['col2', 'col3'], value = 0)
           col1  col2  col3
        0     1   0.0   0.0
        1     2   4.0   5.0
        2     3   0.0   6.0

    Args:
        df: A pandas DataFrame.
        column_names: A column name or an iterable (list
            or tuple) of column names. If a single column name is passed in,
            then only that column will be filled; if a list or tuple is passed
            in, then those columns will all be filled with the same value.
        value: The value that replaces the `NaN` values.

    Returns:
        A pandas DataFrame with `NaN` values filled.
    """

    check_column(df, column_names)
    return _fill_empty(df, column_names, value=value)

filter_column_isin(df, column_name, iterable, complement=False)

Filter a dataframe for values in a column that exist in the given iterable.

This method does not mutate the original DataFrame.

Assumes exact matching; fuzzy matching not implemented.

Examples:

Filter the dataframe to retain rows for which names are exactly James or John.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"names": ["Jane", "Jeremy", "John"], "foo": list("xyz")})
>>> df
    names foo
0    Jane   x
1  Jeremy   y
2    John   z
>>> df.filter_column_isin(column_name="names", iterable=["James", "John"])
  names foo
2  John   z

This is the method-chaining alternative to:

df = df[df["names"].isin(["James", "John"])]

If complement=True, then we will only get rows for which the names are neither James nor John.

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_name Hashable

The column on which to filter.

required
iterable Iterable

An iterable. Could be a list, tuple, another pandas Series.

required
complement bool

Whether to return the complement of the selection or not.

False

Raises:

Type Description
ValueError

If iterable does not have a length of 1 or greater.

Returns:

Type Description
DataFrame

A filtered pandas DataFrame.

Source code in janitor/functions/filter.py
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
@pf.register_dataframe_method
@deprecated_alias(column="column_name")
def filter_column_isin(
    df: pd.DataFrame,
    column_name: Hashable,
    iterable: Iterable,
    complement: bool = False,
) -> pd.DataFrame:
    """Filter a dataframe for values in a column that exist in the given iterable.

    This method does not mutate the original DataFrame.

    Assumes exact matching; fuzzy matching not implemented.

    Examples:
        Filter the dataframe to retain rows for which `names`
        are exactly `James` or `John`.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"names": ["Jane", "Jeremy", "John"], "foo": list("xyz")})
        >>> df
            names foo
        0    Jane   x
        1  Jeremy   y
        2    John   z
        >>> df.filter_column_isin(column_name="names", iterable=["James", "John"])
          names foo
        2  John   z

        This is the method-chaining alternative to:

        ```python
        df = df[df["names"].isin(["James", "John"])]
        ```

        If `complement=True`, then we will only get rows for which the names
        are neither `James` nor `John`.

    Args:
        df: A pandas DataFrame.
        column_name: The column on which to filter.
        iterable: An iterable. Could be a list, tuple, another pandas
            Series.
        complement: Whether to return the complement of the selection or
            not.

    Raises:
        ValueError: If `iterable` does not have a length of `1`
            or greater.

    Returns:
        A filtered pandas DataFrame.
    """  # noqa: E501
    if len(iterable) == 0:
        raise ValueError(
            "`iterable` kwarg must be given an iterable of length 1 "
            "or greater."
        )
    criteria = df[column_name].isin(iterable)

    if complement:
        return df[~criteria]
    return df[criteria]

filter_date(df, column_name, start_date=None, end_date=None, years=None, months=None, days=None, column_date_options=None, format=None)

Filter a date-based column based on certain criteria.

This method does not mutate the original DataFrame.

Dates may be finicky and this function builds on top of the magic from the pandas to_datetime function that is able to parse dates well.

Additional options to parse the date type of your column may be found at the official pandas documentation.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "a": range(5, 9),
...     "dt": ["2021-11-12", "2021-12-15", "2022-01-03", "2022-01-09"],
... })
>>> df
   a          dt
0  5  2021-11-12
1  6  2021-12-15
2  7  2022-01-03
3  8  2022-01-09
>>> df.filter_date("dt", start_date="2021-12-01", end_date="2022-01-05")
   a         dt
1  6 2021-12-15
2  7 2022-01-03
>>> df.filter_date("dt", years=[2021], months=[12])
   a         dt
1  6 2021-12-15

Note

This method will cast your column to a Timestamp!

Note

This only affects the format of the start_date and end_date parameters. If there's an issue with the format of the DataFrame being parsed, you would pass {'format': your_format} to column_date_options.

Parameters:

Name Type Description Default
df DataFrame

The dataframe to filter on.

required
column_name Hashable

The column which to apply the fraction transformation.

required
start_date Optional[date]

The beginning date to use to filter the DataFrame.

None
end_date Optional[date]

The end date to use to filter the DataFrame.

None
years Optional[List]

The years to use to filter the DataFrame.

None
months Optional[List]

The months to use to filter the DataFrame.

None
days Optional[List]

The days to use to filter the DataFrame.

None
column_date_options Optional[Dict]

Special options to use when parsing the date column in the original DataFrame. The options may be found at the official Pandas documentation.

None
format Optional[str]

If you're using a format for start_date or end_date that is not recognized natively by pandas' to_datetime function, you may supply the format yourself. Python date and time formats may be found here.

None

Returns:

Type Description
DataFrame

A filtered pandas DataFrame.

Source code in janitor/functions/filter.py
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
@pf.register_dataframe_method
@deprecated_alias(column="column_name", start="start_date", end="end_date")
def filter_date(
    df: pd.DataFrame,
    column_name: Hashable,
    start_date: Optional[dt.date] = None,
    end_date: Optional[dt.date] = None,
    years: Optional[List] = None,
    months: Optional[List] = None,
    days: Optional[List] = None,
    column_date_options: Optional[Dict] = None,
    format: Optional[str] = None,  # skipcq: PYL-W0622
) -> pd.DataFrame:
    """Filter a date-based column based on certain criteria.

    This method does not mutate the original DataFrame.

    Dates may be finicky and this function builds on top of the *magic* from
    the pandas `to_datetime` function that is able to parse dates well.

    Additional options to parse the date type of your column may be found at
    the official pandas [documentation][datetime].

    [datetime]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "a": range(5, 9),
        ...     "dt": ["2021-11-12", "2021-12-15", "2022-01-03", "2022-01-09"],
        ... })
        >>> df
           a          dt
        0  5  2021-11-12
        1  6  2021-12-15
        2  7  2022-01-03
        3  8  2022-01-09
        >>> df.filter_date("dt", start_date="2021-12-01", end_date="2022-01-05")
           a         dt
        1  6 2021-12-15
        2  7 2022-01-03
        >>> df.filter_date("dt", years=[2021], months=[12])
           a         dt
        1  6 2021-12-15

    !!!note

        This method will cast your column to a Timestamp!

    !!!note

        This only affects the format of the `start_date` and `end_date`
        parameters. If there's an issue with the format of the DataFrame being
        parsed, you would pass `{'format': your_format}` to `column_date_options`.

    Args:
        df: The dataframe to filter on.
        column_name: The column which to apply the fraction transformation.
        start_date: The beginning date to use to filter the DataFrame.
        end_date: The end date to use to filter the DataFrame.
        years: The years to use to filter the DataFrame.
        months: The months to use to filter the DataFrame.
        days: The days to use to filter the DataFrame.
        column_date_options: Special options to use when parsing the date
            column in the original DataFrame. The options may be found at the
            official Pandas documentation.
        format: If you're using a format for `start_date` or `end_date`
            that is not recognized natively by pandas' `to_datetime` function, you
            may supply the format yourself. Python date and time formats may be
            found [here](http://strftime.org/).

    Returns:
        A filtered pandas DataFrame.
    """  # noqa: E501

    def _date_filter_conditions(conditions):
        """Taken from: https://stackoverflow.com/a/13616382."""
        return reduce(np.logical_and, conditions)

    if column_date_options is None:
        column_date_options = {}
    df[column_name] = pd.to_datetime(df[column_name], **column_date_options)

    _filter_list = []

    if start_date:
        start_date = pd.to_datetime(start_date, format=format)
        _filter_list.append(df[column_name] >= start_date)

    if end_date:
        end_date = pd.to_datetime(end_date, format=format)
        _filter_list.append(df[column_name] <= end_date)

    if years:
        _filter_list.append(df[column_name].dt.year.isin(years))

    if months:
        _filter_list.append(df[column_name].dt.month.isin(months))

    if days:
        _filter_list.append(df[column_name].dt.day.isin(days))

    if start_date and end_date and start_date > end_date:
        warnings.warn(
            f"Your start date of {start_date} is after your end date of "
            f"{end_date}. Is this intended?"
        )

    return df.loc[_date_filter_conditions(_filter_list), :]

filter_on(df, criteria, complement=False)

Return a dataframe filtered on a particular criteria.

This method does not mutate the original DataFrame.

This is super-sugary syntax that wraps the pandas .query() API, enabling users to use strings to quickly specify filters for filtering their dataframe. The intent is that filter_on as a verb better matches the intent of a pandas user than the verb query.

This is intended to be the method-chaining equivalent of the following:

df = df[df["score"] < 3]

Note

This function will be deprecated in a 1.x release. Please use pd.DataFrame.query instead.

Examples:

Filter students who failed an exam (scored less than 50).

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "student_id": ["S1", "S2", "S3"],
...     "score": [40, 60, 85],
... })
>>> df
  student_id  score
0         S1     40
1         S2     60
2         S3     85
>>> df.filter_on("score < 50", complement=False)
  student_id  score
0         S1     40

Credit to Brant Peterson for the name.

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
criteria str

A filtering criteria that returns an array or Series of booleans, on which pandas can filter on.

required
complement bool

Whether to return the complement of the filter or not. If set to True, then the rows for which the criteria is False are retained instead.

False

Returns:

Type Description
DataFrame

A filtered pandas DataFrame.

Source code in janitor/functions/filter.py
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
@pf.register_dataframe_method
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `pd.DataFrame.query` instead."
    )
)
def filter_on(
    df: pd.DataFrame,
    criteria: str,
    complement: bool = False,
) -> pd.DataFrame:
    """Return a dataframe filtered on a particular criteria.

    This method does not mutate the original DataFrame.

    This is super-sugary syntax that wraps the pandas `.query()` API, enabling
    users to use strings to quickly specify filters for filtering their
    dataframe. The intent is that `filter_on` as a verb better matches the
    intent of a pandas user than the verb `query`.

    This is intended to be the method-chaining equivalent of the following:

    ```python
    df = df[df["score"] < 3]
    ```

    !!!note

        This function will be deprecated in a 1.x release.
        Please use `pd.DataFrame.query` instead.


    Examples:
        Filter students who failed an exam (scored less than 50).

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "student_id": ["S1", "S2", "S3"],
        ...     "score": [40, 60, 85],
        ... })
        >>> df
          student_id  score
        0         S1     40
        1         S2     60
        2         S3     85
        >>> df.filter_on("score < 50", complement=False)
          student_id  score
        0         S1     40

    Credit to Brant Peterson for the name.

    Args:
        df: A pandas DataFrame.
        criteria: A filtering criteria that returns an array or Series of
            booleans, on which pandas can filter on.
        complement: Whether to return the complement of the filter or not.
            If set to True, then the rows for which the criteria is False are
            retained instead.

    Returns:
        A filtered pandas DataFrame.
    """

    warnings.warn(
        "This function will be deprecated in a 1.x release. "
        "Kindly use `pd.DataFrame.query` instead.",
        DeprecationWarning,
        stacklevel=find_stack_level(),
    )

    if complement:
        return df.query(f"not ({criteria})")
    return df.query(criteria)

filter_string(df, column_name, search_string, complement=False, case=True, flags=0, na=None, regex=True)

Filter a string-based column according to whether it contains a substring.

This is super sugary syntax that builds on top of pandas.Series.str.contains. It is meant to be the method-chaining equivalent of the following:

df = df[df[column_name].str.contains(search_string)]]

This method does not mutate the original DataFrame.

Examples:

Retain rows whose column values contain a particular substring.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": range(3, 6), "b": ["bear", "peeL", "sail"]})
>>> df
   a     b
0  3  bear
1  4  peeL
2  5  sail
>>> df.filter_string(column_name="b", search_string="ee")
   a     b
1  4  peeL
>>> df.filter_string(column_name="b", search_string="L", case=False)
   a     b
1  4  peeL
2  5  sail

Filter names does not contain '.' (disable regex mode).

>>> import pandas as pd
>>> import janitor
>>> df = pd.Series(["JoseChen", "Brian.Salvi"], name="Name").to_frame()
>>> df
          Name
0     JoseChen
1  Brian.Salvi
>>> df.filter_string(column_name="Name", search_string=".", regex=False, complement=True)
       Name
0  JoseChen

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_name Hashable

The column to filter. The column should contain strings.

required
search_string str

A regex pattern or a (sub-)string to search.

required
complement bool

Whether to return the complement of the filter or not. If set to True, then the rows for which the string search fails are retained instead.

False
case bool

If True, case sensitive.

True
flags int

Flags to pass through to the re module, e.g. re.IGNORECASE.

0
na Any

Fill value for missing values. The default depends on dtype of the array. For object-dtype, numpy.nan is used. For StringDtype, pandas.NA is used.

None
regex bool

If True, assumes search_string is a regular expression. If False, treats the search_string as a literal string.

True

Returns:

Type Description
DataFrame

A filtered pandas DataFrame.

Source code in janitor/functions/filter.py
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
@pf.register_dataframe_method
@deprecated_alias(column="column_name")
def filter_string(
    df: pd.DataFrame,
    column_name: Hashable,
    search_string: str,
    complement: bool = False,
    case: bool = True,
    flags: int = 0,
    na: Any = None,
    regex: bool = True,
) -> pd.DataFrame:
    """Filter a string-based column according to whether it contains a substring.

    This is super sugary syntax that builds on top of `pandas.Series.str.contains`.
    It is meant to be the method-chaining equivalent of the following:

    ```python
    df = df[df[column_name].str.contains(search_string)]]
    ```

    This method does not mutate the original DataFrame.

    Examples:
        Retain rows whose column values contain a particular substring.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"a": range(3, 6), "b": ["bear", "peeL", "sail"]})
        >>> df
           a     b
        0  3  bear
        1  4  peeL
        2  5  sail
        >>> df.filter_string(column_name="b", search_string="ee")
           a     b
        1  4  peeL
        >>> df.filter_string(column_name="b", search_string="L", case=False)
           a     b
        1  4  peeL
        2  5  sail

        Filter names does not contain `'.'` (disable regex mode).

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.Series(["JoseChen", "Brian.Salvi"], name="Name").to_frame()
        >>> df
                  Name
        0     JoseChen
        1  Brian.Salvi
        >>> df.filter_string(column_name="Name", search_string=".", regex=False, complement=True)
               Name
        0  JoseChen

    Args:
        df: A pandas DataFrame.
        column_name: The column to filter. The column should contain strings.
        search_string: A regex pattern or a (sub-)string to search.
        complement: Whether to return the complement of the filter or not. If
            set to True, then the rows for which the string search fails are retained
            instead.
        case: If True, case sensitive.
        flags: Flags to pass through to the re module, e.g. re.IGNORECASE.
        na: Fill value for missing values. The default depends on dtype of
            the array. For object-dtype, `numpy.nan` is used. For `StringDtype`,
            `pandas.NA` is used.
        regex: If True, assumes `search_string` is a regular expression. If False,
            treats the `search_string` as a literal string.

    Returns:
        A filtered pandas DataFrame.
    """  # noqa: E501

    criteria = df[column_name].str.contains(
        pat=search_string,
        case=case,
        flags=flags,
        na=na,
        regex=regex,
    )

    if complement:
        return df[~criteria]

    return df[criteria]

get_columns(group, label)

Helper function for selecting columns on a grouped object, using the select syntax.

New in version 0.25.0

Parameters:

Name Type Description Default
group Union[DataFrameGroupBy, SeriesGroupBy]

A Pandas GroupBy object.

required
label

column(s) to select.

required

Returns:

Type Description

A pandas groupby object.

Source code in janitor/functions/utils.py
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
def get_columns(group: Union[DataFrameGroupBy, SeriesGroupBy], label):
    """
    Helper function for selecting columns on a grouped object,
    using the
    [`select`][janitor.functions.select.select] syntax.

    !!! info "New in version 0.25.0"

    Args:
        group: A Pandas GroupBy object.
        label: column(s) to select.

    Returns:
        A pandas groupby object.
    """
    check("groupby object", group, [DataFrameGroupBy, SeriesGroupBy])
    label = get_index_labels(label, group.obj, axis="columns")
    label = label if is_scalar(label) else list(label)
    return group[label]

get_index_labels(arg, df, axis)

Convenience function to get actual labels from column/index

New in version 0.25.0

Parameters:

Name Type Description Default
arg

Valid inputs include: an exact column name to look for, a shell-style glob string (e.g. *_thing_*), a regular expression, a callable, or variable arguments of all the aforementioned. A sequence of booleans is also acceptable. A dictionary can be used for selection on a MultiIndex on different levels.

required
df DataFrame

The pandas DataFrame object.

required
axis Literal['index', 'columns']

Should be either index or columns.

required

Returns:

Type Description
Index

A pandas Index.

Source code in janitor/functions/utils.py
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
def get_index_labels(
    arg, df: pd.DataFrame, axis: Literal["index", "columns"]
) -> pd.Index:
    """Convenience function to get actual labels from column/index

    !!! info "New in version 0.25.0"

    Args:
        arg: Valid inputs include: an exact column name to look for,
            a shell-style glob string (e.g. `*_thing_*`),
            a regular expression,
            a callable,
            or variable arguments of all the aforementioned.
            A sequence of booleans is also acceptable.
            A dictionary can be used for selection
            on a MultiIndex on different levels.
        df: The pandas DataFrame object.
        axis: Should be either `index` or `columns`.

    Returns:
        A pandas Index.
    """
    assert axis in {"index", "columns"}
    index = getattr(df, axis)
    return index[_select_index(arg, df, axis)]

get_join_indices(df, right, conditions, keep='all', use_numba=False, force=False)

Convenience function to return the matching indices from an inner join.

New in version 0.27.0

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
right Union[DataFrame, Series]

Named Series or DataFrame to join to.

required
conditions list[tuple[str]]

List of arguments of tuple(s) of the form (left_on, right_on, op), where left_on is the column label from df, right_on is the column label from right, while op is the operator. The col class is also supported. The operator can be any of ==, !=, <=, <, >=, >. For multiple conditions, the and(&) operator is used to combine the results of the individual conditions.

required
use_numba bool

Use numba, if installed, to accelerate the computation.

False
keep Literal['first', 'last', 'all']

Choose whether to return the first match, last match or all matches.

'all'
force bool

If True, force the non-equi join conditions to execute before the equi join.

False

Returns:

Type Description
tuple[ndarray, ndarray]

A tuple of indices for the rows in the dataframes that match.

Source code in janitor/functions/conditional_join.py
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
def get_join_indices(
    df: pd.DataFrame,
    right: Union[pd.DataFrame, pd.Series],
    conditions: list[tuple[str]],
    keep: Literal["first", "last", "all"] = "all",
    use_numba: bool = False,
    force: bool = False,
) -> tuple[np.ndarray, np.ndarray]:
    """Convenience function to return the matching indices from an inner join.

    !!! info "New in version 0.27.0"

    Args:
        df: A pandas DataFrame.
        right: Named Series or DataFrame to join to.
        conditions: List of arguments of tuple(s) of the form
            `(left_on, right_on, op)`, where `left_on` is the column
            label from `df`, `right_on` is the column label from `right`,
            while `op` is the operator.
            The `col` class is also supported. The operator can be any of
            `==`, `!=`, `<=`, `<`, `>=`, `>`. For multiple conditions,
            the and(`&`) operator is used to combine the results
            of the individual conditions.
        use_numba: Use numba, if installed, to accelerate the computation.
        keep: Choose whether to return the first match, last match or all matches.
        force: If `True`, force the non-equi join conditions
            to execute before the equi join.

    Returns:
        A tuple of indices for the rows in the dataframes that match.
    """
    return _conditional_join_compute(
        df=df,
        right=right,
        conditions=conditions,
        how="inner",
        sort_by_appearance=False,
        df_columns=None,
        right_columns=None,
        keep=keep,
        use_numba=use_numba,
        indicator=False,
        force=force,
        return_matching_indices=True,
    )

patterns(regex_pattern)

This function converts a string into a compiled regular expression.

It can be used to select columns in the index or columns_names arguments of pivot_longer function.

Warning

This function is deprecated. Kindly use re.compile instead.

Parameters:

Name Type Description Default
regex_pattern Union[str, Pattern]

String to be converted to compiled regular expression.

required

Returns:

Type Description
Pattern

A compile regular expression from provided regex_pattern.

Source code in janitor/functions/utils.py
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
def patterns(regex_pattern: Union[str, Pattern]) -> Pattern:
    """This function converts a string into a compiled regular expression.

    It can be used to select columns in the index or columns_names
    arguments of `pivot_longer` function.

    !!!warning

        This function is deprecated. Kindly use `re.compile` instead.

    Args:
        regex_pattern: String to be converted to compiled regular
            expression.

    Returns:
        A compile regular expression from provided `regex_pattern`.
    """
    warnings.warn(
        "This function is deprecated. Kindly use `re.compile` instead.",
        DeprecationWarning,
        stacklevel=find_stack_level(),
    )
    check("regular expression", regex_pattern, [str, Pattern])

    return re.compile(regex_pattern)

pivot_longer(df, index=None, column_names=None, names_to=None, values_to='value', column_level=None, names_sep=None, names_pattern=None, names_transform=None, dropna=False, sort_by_appearance=False, ignore_index=True)

Unpivots a DataFrame from wide to long format.

This method does not mutate the original DataFrame.

It is modeled after the pivot_longer function in R's tidyr package, and also takes inspiration from R's data.table package.

This function is useful to massage a DataFrame into a format where one or more columns are considered measured variables, and all other columns are considered as identifier variables.

All measured variables are unpivoted (and typically duplicated) along the row axis.

Column selection in index and column_names is possible using the select syntax.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame(
...     {
...         "Sepal.Length": [5.1, 5.9],
...         "Sepal.Width": [3.5, 3.0],
...         "Petal.Length": [1.4, 5.1],
...         "Petal.Width": [0.2, 1.8],
...         "Species": ["setosa", "virginica"],
...     }
... )
>>> df
   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
0           5.1          3.5           1.4          0.2     setosa
1           5.9          3.0           5.1          1.8  virginica

Replicate pandas' melt:

>>> df.pivot_longer(index = 'Species')
     Species      variable  value
0     setosa  Sepal.Length    5.1
1  virginica  Sepal.Length    5.9
2     setosa   Sepal.Width    3.5
3  virginica   Sepal.Width    3.0
4     setosa  Petal.Length    1.4
5  virginica  Petal.Length    5.1
6     setosa   Petal.Width    0.2
7  virginica   Petal.Width    1.8

Convenient, flexible column selection in the index via the select syntax:

>>> from pandas.api.types import is_string_dtype
>>> df.pivot_longer(index = is_string_dtype)
     Species      variable  value
0     setosa  Sepal.Length    5.1
1  virginica  Sepal.Length    5.9
2     setosa   Sepal.Width    3.5
3  virginica   Sepal.Width    3.0
4     setosa  Petal.Length    1.4
5  virginica  Petal.Length    5.1
6     setosa   Petal.Width    0.2
7  virginica   Petal.Width    1.8

Split the column labels into parts:

>>> df.pivot_longer(
...     index = 'Species',
...     names_to = ('part', 'dimension'),
...     names_sep = '.',
...     sort_by_appearance = True,
... )
     Species   part dimension  value
0     setosa  Sepal    Length    5.1
1     setosa  Sepal     Width    3.5
2     setosa  Petal    Length    1.4
3     setosa  Petal     Width    0.2
4  virginica  Sepal    Length    5.9
5  virginica  Sepal     Width    3.0
6  virginica  Petal    Length    5.1
7  virginica  Petal     Width    1.8

Retain parts of the column names as headers:

>>> df.pivot_longer(
...     index = 'Species',
...     names_to = ('part', '.value'),
...     names_sep = '.',
...     sort_by_appearance = True,
... )
     Species   part  Length  Width
0     setosa  Sepal     5.1    3.5
1     setosa  Petal     1.4    0.2
2  virginica  Sepal     5.9    3.0
3  virginica  Petal     5.1    1.8

Split the column labels based on regex:

>>> df = pd.DataFrame({"id": [1], "new_sp_m5564": [2], "newrel_f65": [3]})
>>> df
   id  new_sp_m5564  newrel_f65
0   1             2           3
>>> df.pivot_longer(
...     index = 'id',
...     names_to = ('diagnosis', 'gender', 'age'),
...     names_pattern = r"new_?(.+)_(.)(\d+)",
... )
   id diagnosis gender   age  value
0   1        sp      m  5564      2
1   1       rel      f    65      3

Split the column labels for the above dataframe using named groups in names_pattern:

>>> df.pivot_longer(
...     index = 'id',
...     names_pattern = r"new_?(?P<diagnosis>.+)_(?P<gender>.)(?P<age>\d+)",
... )
    id diagnosis gender   age  value
0   1        sp      m  5564      2
1   1       rel      f    65      3

Convert the dtypes of specific columns with names_transform:

>>> result = (df
...          .pivot_longer(
...              index = 'id',
...              names_to = ('diagnosis', 'gender', 'age'),
...              names_pattern = r"new_?(.+)_(.)(\d+)",
...              names_transform = {'gender': 'category', 'age':'int'})
... )
>>> result.dtypes
id           int64
diagnosis   object
gender    category
age          int64
value        int64
dtype: object

Use multiple .value to reshape dataframe:

>>> df = pd.DataFrame(
...     [
...         {
...             "x_1_mean": 10,
...             "x_2_mean": 20,
...             "y_1_mean": 30,
...             "y_2_mean": 40,
...             "unit": 50,
...         }
...     ]
... )
>>> df
   x_1_mean  x_2_mean  y_1_mean  y_2_mean  unit
0        10        20        30        40    50
>>> df.pivot_longer(
...     index="unit",
...     names_to=(".value", "time", ".value"),
...     names_pattern=r"(x|y)_([0-9])(_mean)",
... )
   unit time  x_mean  y_mean
0    50    1      10      30
1    50    2      20      40

Replicate the above with named groups in names_pattern - use _ instead of .value:

>>> df.pivot_longer(
...     index="unit",
...     names_pattern=r"(?P<_>x|y)_(?P<time>[0-9])(?P<__>_mean)",
... )
   unit time  x_mean  y_mean
0    50    1      10      30
1    50    2      20      40

Convenient, flexible column selection in the column_names via select syntax:

>>> df.pivot_longer(
...     column_names="*mean",
...     names_to=(".value", "time", ".value"),
...     names_pattern=r"(x|y)_([0-9])(_mean)",
... )
   unit time  x_mean  y_mean
0    50    1      10      30
1    50    2      20      40
>>> df.pivot_longer(
...     column_names=slice("x_1_mean", "y_2_mean"),
...     names_to=(".value", "time", ".value"),
...     names_pattern=r"(x|y)_([0-9])(_mean)",
... )
   unit time  x_mean  y_mean
0    50    1      10      30
1    50    2      20      40

Reshape dataframe by passing a sequence to names_pattern:

>>> df = pd.DataFrame({'hr1': [514, 573],
...                    'hr2': [545, 526],
...                    'team': ['Red Sox', 'Yankees'],
...                    'year1': [2007, 2007],
...                    'year2': [2008, 2008]})
>>> df
   hr1  hr2     team  year1  year2
0  514  545  Red Sox   2007   2008
1  573  526  Yankees   2007   2008
>>> df.pivot_longer(
...     index = 'team',
...     names_to = ['year', 'hr'],
...     names_pattern = ['year', 'hr']
... )
      team   hr  year
0  Red Sox  514  2007
1  Yankees  573  2007
2  Red Sox  545  2008
3  Yankees  526  2008

Reshape above dataframe by passing a dictionary to names_pattern:

>>> df.pivot_longer(
...     index = 'team',
...     names_pattern = {"year":"year", "hr":"hr"}
... )
      team   hr  year
0  Red Sox  514  2007
1  Yankees  573  2007
2  Red Sox  545  2008
3  Yankees  526  2008

Multiple values_to:

>>> df = pd.DataFrame(
...         {
...             "City": ["Houston", "Austin", "Hoover"],
...             "State": ["Texas", "Texas", "Alabama"],
...             "Name": ["Aria", "Penelope", "Niko"],
...             "Mango": [4, 10, 90],
...             "Orange": [10, 8, 14],
...             "Watermelon": [40, 99, 43],
...             "Gin": [16, 200, 34],
...             "Vodka": [20, 33, 18],
...         },
...         columns=[
...             "City",
...             "State",
...             "Name",
...             "Mango",
...             "Orange",
...             "Watermelon",
...             "Gin",
...             "Vodka",
...         ],
...     )
>>> df
      City    State      Name  Mango  Orange  Watermelon  Gin  Vodka
0  Houston    Texas      Aria      4      10          40   16     20
1   Austin    Texas  Penelope     10       8          99  200     33
2   Hoover  Alabama      Niko     90      14          43   34     18
>>> df.pivot_longer(
...         index=["City", "State"],
...         column_names=slice("Mango", "Vodka"),
...         names_to=("Fruit", "Drink"),
...         values_to=("Pounds", "Ounces"),
...         names_pattern=["M|O|W", "G|V"],
...     )
      City    State       Fruit  Pounds  Drink  Ounces
0  Houston    Texas       Mango       4    Gin    16.0
1   Austin    Texas       Mango      10    Gin   200.0
2   Hoover  Alabama       Mango      90    Gin    34.0
3  Houston    Texas      Orange      10  Vodka    20.0
4   Austin    Texas      Orange       8  Vodka    33.0
5   Hoover  Alabama      Orange      14  Vodka    18.0
6  Houston    Texas  Watermelon      40   None     NaN
7   Austin    Texas  Watermelon      99   None     NaN
8   Hoover  Alabama  Watermelon      43   None     NaN

Replicate the above transformation with a nested dictionary passed to names_pattern - the outer keys in the names_pattern dictionary are passed to names_to, while the inner keys are passed to values_to:

>>> df.pivot_longer(
...     index=["City", "State"],
...     column_names=slice("Mango", "Vodka"),
...     names_pattern={
...         "Fruit": {"Pounds": "M|O|W"},
...         "Drink": {"Ounces": "G|V"},
...     },
... )
      City    State       Fruit  Pounds  Drink  Ounces
0  Houston    Texas       Mango       4    Gin    16.0
1   Austin    Texas       Mango      10    Gin   200.0
2   Hoover  Alabama       Mango      90    Gin    34.0
3  Houston    Texas      Orange      10  Vodka    20.0
4   Austin    Texas      Orange       8  Vodka    33.0
5   Hoover  Alabama      Orange      14  Vodka    18.0
6  Houston    Texas  Watermelon      40   None     NaN
7   Austin    Texas  Watermelon      99   None     NaN
8   Hoover  Alabama  Watermelon      43   None     NaN

Version Changed

  • 0.24.0
    • Added dropna parameter.
  • 0.24.1
    • names_pattern can accept a dictionary.
    • named groups supported in names_pattern.

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
index Optional[Union[list, tuple, str, Pattern]]

Name(s) of columns to use as identifier variables. Should be either a single column name, or a list/tuple of column names. index should be a list of tuples if the columns are a MultiIndex.

None
column_names Optional[Union[list, tuple, str, Pattern]]

Name(s) of columns to unpivot. Should be either a single column name or a list/tuple of column names. column_names should be a list of tuples if the columns are a MultiIndex.

None
names_to Optional[Union[list, tuple, str]]

Name of new column as a string that will contain what were previously the column names in column_names. The default is variable if no value is provided. It can also be a list/tuple of strings that will serve as new column names, if name_sep or names_pattern is provided. If .value is in names_to, new column names will be extracted from part of the existing column names and overridesvalues_to.

None
values_to Optional[str]

Name of new column as a string that will contain what were previously the values of the columns in column_names. values_to can also be a list/tuple and requires that names_pattern is also a list/tuple.

'value'
column_level Optional[Union[int, str]]

If columns are a MultiIndex, then use this level to unpivot the DataFrame. Provided for compatibility with pandas' melt, and applies only if neither names_sep nor names_pattern is provided.

None
names_sep Optional[Union[str, Pattern]]

Determines how the column name is broken up, if names_to contains multiple values. It takes the same specification as pandas' str.split method, and can be a string or regular expression. names_sep does not work with MultiIndex columns.

None
names_pattern Optional[Union[list, tuple, str, Pattern]]

Determines how the column name is broken up. It can be a regular expression containing matching groups. Under the hood it is processed with pandas' str.extract function. If it is a single regex, the number of groups must match the length of names_to. Named groups are supported, if names_to is none. _ is used instead of .value as a placeholder in named groups. _ can be overloaded for multiple .value calls - _, __, ___, ... names_pattern can also be a list/tuple of regular expressions It can also be a list/tuple of strings; the strings will be treated as regular expressions. Under the hood it is processed with pandas' str.contains function. For a list/tuple of regular expressions, names_to must also be a list/tuple and the lengths of both arguments must match. names_pattern can also be a dictionary, where the keys are the new column names, while the values can be a regular expression or a string which will be evaluated as a regular expression. Alternatively, a nested dictionary can be used, where the sub key(s) are associated with values_to. Please have a look at the examples for usage. names_pattern does not work with MultiIndex columns.

None
names_transform Optional[Union[str, Callable, dict]]

Use this option to change the types of columns that have been transformed to rows. This does not applies to the values' columns. Accepts any argument that is acceptable by pd.astype.

None
dropna bool

Determines whether or not to drop nulls from the values columns. Default is False.

False
sort_by_appearance Optional[bool]

Boolean value that determines the final look of the DataFrame. If True, the unpivoted DataFrame will be stacked in order of first appearance.

False
ignore_index Optional[bool]

If True, the original index is ignored. If False, the original index is retained and the index labels will be repeated as necessary.

True

Returns:

Type Description
DataFrame

A pandas DataFrame that has been unpivoted from wide to long format.

Source code in janitor/functions/pivot.py
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
@pf.register_dataframe_method
def pivot_longer(
    df: pd.DataFrame,
    index: Optional[Union[list, tuple, str, Pattern]] = None,
    column_names: Optional[Union[list, tuple, str, Pattern]] = None,
    names_to: Optional[Union[list, tuple, str]] = None,
    values_to: Optional[str] = "value",
    column_level: Optional[Union[int, str]] = None,
    names_sep: Optional[Union[str, Pattern]] = None,
    names_pattern: Optional[Union[list, tuple, str, Pattern]] = None,
    names_transform: Optional[Union[str, Callable, dict]] = None,
    dropna: bool = False,
    sort_by_appearance: Optional[bool] = False,
    ignore_index: Optional[bool] = True,
) -> pd.DataFrame:
    """Unpivots a DataFrame from *wide* to *long* format.

    This method does not mutate the original DataFrame.

    It is modeled after the `pivot_longer` function in R's tidyr package,
    and also takes inspiration from R's data.table package.

    This function is useful to massage a DataFrame into a format where
    one or more columns are considered measured variables, and all other
    columns are considered as identifier variables.

    All measured variables are *unpivoted* (and typically duplicated) along the
    row axis.

    Column selection in `index` and `column_names` is possible using the
    [`select`][janitor.functions.select.select] syntax.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame(
        ...     {
        ...         "Sepal.Length": [5.1, 5.9],
        ...         "Sepal.Width": [3.5, 3.0],
        ...         "Petal.Length": [1.4, 5.1],
        ...         "Petal.Width": [0.2, 1.8],
        ...         "Species": ["setosa", "virginica"],
        ...     }
        ... )
        >>> df
           Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
        0           5.1          3.5           1.4          0.2     setosa
        1           5.9          3.0           5.1          1.8  virginica

        Replicate pandas' melt:
        >>> df.pivot_longer(index = 'Species')
             Species      variable  value
        0     setosa  Sepal.Length    5.1
        1  virginica  Sepal.Length    5.9
        2     setosa   Sepal.Width    3.5
        3  virginica   Sepal.Width    3.0
        4     setosa  Petal.Length    1.4
        5  virginica  Petal.Length    5.1
        6     setosa   Petal.Width    0.2
        7  virginica   Petal.Width    1.8

        Convenient, flexible column selection in the `index` via the
        [`select`][janitor.functions.select.select] syntax:
        >>> from pandas.api.types import is_string_dtype
        >>> df.pivot_longer(index = is_string_dtype)
             Species      variable  value
        0     setosa  Sepal.Length    5.1
        1  virginica  Sepal.Length    5.9
        2     setosa   Sepal.Width    3.5
        3  virginica   Sepal.Width    3.0
        4     setosa  Petal.Length    1.4
        5  virginica  Petal.Length    5.1
        6     setosa   Petal.Width    0.2
        7  virginica   Petal.Width    1.8

        Split the column labels into parts:
        >>> df.pivot_longer(
        ...     index = 'Species',
        ...     names_to = ('part', 'dimension'),
        ...     names_sep = '.',
        ...     sort_by_appearance = True,
        ... )
             Species   part dimension  value
        0     setosa  Sepal    Length    5.1
        1     setosa  Sepal     Width    3.5
        2     setosa  Petal    Length    1.4
        3     setosa  Petal     Width    0.2
        4  virginica  Sepal    Length    5.9
        5  virginica  Sepal     Width    3.0
        6  virginica  Petal    Length    5.1
        7  virginica  Petal     Width    1.8

        Retain parts of the column names as headers:
        >>> df.pivot_longer(
        ...     index = 'Species',
        ...     names_to = ('part', '.value'),
        ...     names_sep = '.',
        ...     sort_by_appearance = True,
        ... )
             Species   part  Length  Width
        0     setosa  Sepal     5.1    3.5
        1     setosa  Petal     1.4    0.2
        2  virginica  Sepal     5.9    3.0
        3  virginica  Petal     5.1    1.8

        Split the column labels based on regex:
        >>> df = pd.DataFrame({"id": [1], "new_sp_m5564": [2], "newrel_f65": [3]})
        >>> df
           id  new_sp_m5564  newrel_f65
        0   1             2           3
        >>> df.pivot_longer(
        ...     index = 'id',
        ...     names_to = ('diagnosis', 'gender', 'age'),
        ...     names_pattern = r"new_?(.+)_(.)(\\d+)",
        ... )
           id diagnosis gender   age  value
        0   1        sp      m  5564      2
        1   1       rel      f    65      3

        Split the column labels for the above dataframe using named groups in `names_pattern`:
        >>> df.pivot_longer(
        ...     index = 'id',
        ...     names_pattern = r"new_?(?P<diagnosis>.+)_(?P<gender>.)(?P<age>\\d+)",
        ... )
            id diagnosis gender   age  value
        0   1        sp      m  5564      2
        1   1       rel      f    65      3

        Convert the dtypes of specific columns with `names_transform`:
        >>> result = (df
        ...          .pivot_longer(
        ...              index = 'id',
        ...              names_to = ('diagnosis', 'gender', 'age'),
        ...              names_pattern = r"new_?(.+)_(.)(\\d+)",
        ...              names_transform = {'gender': 'category', 'age':'int'})
        ... )
        >>> result.dtypes
        id           int64
        diagnosis   object
        gender    category
        age          int64
        value        int64
        dtype: object

        Use multiple `.value` to reshape dataframe:
        >>> df = pd.DataFrame(
        ...     [
        ...         {
        ...             "x_1_mean": 10,
        ...             "x_2_mean": 20,
        ...             "y_1_mean": 30,
        ...             "y_2_mean": 40,
        ...             "unit": 50,
        ...         }
        ...     ]
        ... )
        >>> df
           x_1_mean  x_2_mean  y_1_mean  y_2_mean  unit
        0        10        20        30        40    50
        >>> df.pivot_longer(
        ...     index="unit",
        ...     names_to=(".value", "time", ".value"),
        ...     names_pattern=r"(x|y)_([0-9])(_mean)",
        ... )
           unit time  x_mean  y_mean
        0    50    1      10      30
        1    50    2      20      40

        Replicate the above with named groups in `names_pattern` - use `_` instead of `.value`:
        >>> df.pivot_longer(
        ...     index="unit",
        ...     names_pattern=r"(?P<_>x|y)_(?P<time>[0-9])(?P<__>_mean)",
        ... )
           unit time  x_mean  y_mean
        0    50    1      10      30
        1    50    2      20      40

        Convenient, flexible column selection in the `column_names` via
        [`select`][janitor.functions.select.select] syntax:
        >>> df.pivot_longer(
        ...     column_names="*mean",
        ...     names_to=(".value", "time", ".value"),
        ...     names_pattern=r"(x|y)_([0-9])(_mean)",
        ... )
           unit time  x_mean  y_mean
        0    50    1      10      30
        1    50    2      20      40

        >>> df.pivot_longer(
        ...     column_names=slice("x_1_mean", "y_2_mean"),
        ...     names_to=(".value", "time", ".value"),
        ...     names_pattern=r"(x|y)_([0-9])(_mean)",
        ... )
           unit time  x_mean  y_mean
        0    50    1      10      30
        1    50    2      20      40

        Reshape dataframe by passing a sequence to `names_pattern`:
        >>> df = pd.DataFrame({'hr1': [514, 573],
        ...                    'hr2': [545, 526],
        ...                    'team': ['Red Sox', 'Yankees'],
        ...                    'year1': [2007, 2007],
        ...                    'year2': [2008, 2008]})
        >>> df
           hr1  hr2     team  year1  year2
        0  514  545  Red Sox   2007   2008
        1  573  526  Yankees   2007   2008
        >>> df.pivot_longer(
        ...     index = 'team',
        ...     names_to = ['year', 'hr'],
        ...     names_pattern = ['year', 'hr']
        ... )
              team   hr  year
        0  Red Sox  514  2007
        1  Yankees  573  2007
        2  Red Sox  545  2008
        3  Yankees  526  2008


        Reshape above dataframe by passing a dictionary to `names_pattern`:
        >>> df.pivot_longer(
        ...     index = 'team',
        ...     names_pattern = {"year":"year", "hr":"hr"}
        ... )
              team   hr  year
        0  Red Sox  514  2007
        1  Yankees  573  2007
        2  Red Sox  545  2008
        3  Yankees  526  2008

        Multiple values_to:
        >>> df = pd.DataFrame(
        ...         {
        ...             "City": ["Houston", "Austin", "Hoover"],
        ...             "State": ["Texas", "Texas", "Alabama"],
        ...             "Name": ["Aria", "Penelope", "Niko"],
        ...             "Mango": [4, 10, 90],
        ...             "Orange": [10, 8, 14],
        ...             "Watermelon": [40, 99, 43],
        ...             "Gin": [16, 200, 34],
        ...             "Vodka": [20, 33, 18],
        ...         },
        ...         columns=[
        ...             "City",
        ...             "State",
        ...             "Name",
        ...             "Mango",
        ...             "Orange",
        ...             "Watermelon",
        ...             "Gin",
        ...             "Vodka",
        ...         ],
        ...     )
        >>> df
              City    State      Name  Mango  Orange  Watermelon  Gin  Vodka
        0  Houston    Texas      Aria      4      10          40   16     20
        1   Austin    Texas  Penelope     10       8          99  200     33
        2   Hoover  Alabama      Niko     90      14          43   34     18
        >>> df.pivot_longer(
        ...         index=["City", "State"],
        ...         column_names=slice("Mango", "Vodka"),
        ...         names_to=("Fruit", "Drink"),
        ...         values_to=("Pounds", "Ounces"),
        ...         names_pattern=["M|O|W", "G|V"],
        ...     )
              City    State       Fruit  Pounds  Drink  Ounces
        0  Houston    Texas       Mango       4    Gin    16.0
        1   Austin    Texas       Mango      10    Gin   200.0
        2   Hoover  Alabama       Mango      90    Gin    34.0
        3  Houston    Texas      Orange      10  Vodka    20.0
        4   Austin    Texas      Orange       8  Vodka    33.0
        5   Hoover  Alabama      Orange      14  Vodka    18.0
        6  Houston    Texas  Watermelon      40   None     NaN
        7   Austin    Texas  Watermelon      99   None     NaN
        8   Hoover  Alabama  Watermelon      43   None     NaN

        Replicate the above transformation with a nested dictionary passed to `names_pattern`
        - the outer keys in the `names_pattern` dictionary are passed to `names_to`,
        while the inner keys are passed to `values_to`:
        >>> df.pivot_longer(
        ...     index=["City", "State"],
        ...     column_names=slice("Mango", "Vodka"),
        ...     names_pattern={
        ...         "Fruit": {"Pounds": "M|O|W"},
        ...         "Drink": {"Ounces": "G|V"},
        ...     },
        ... )
              City    State       Fruit  Pounds  Drink  Ounces
        0  Houston    Texas       Mango       4    Gin    16.0
        1   Austin    Texas       Mango      10    Gin   200.0
        2   Hoover  Alabama       Mango      90    Gin    34.0
        3  Houston    Texas      Orange      10  Vodka    20.0
        4   Austin    Texas      Orange       8  Vodka    33.0
        5   Hoover  Alabama      Orange      14  Vodka    18.0
        6  Houston    Texas  Watermelon      40   None     NaN
        7   Austin    Texas  Watermelon      99   None     NaN
        8   Hoover  Alabama  Watermelon      43   None     NaN

    !!! abstract "Version Changed"

        - 0.24.0
            - Added `dropna` parameter.
        - 0.24.1
            - `names_pattern` can accept a dictionary.
            - named groups supported in `names_pattern`.

    Args:
        df: A pandas DataFrame.
        index: Name(s) of columns to use as identifier variables.
            Should be either a single column name, or a list/tuple of
            column names.
            `index` should be a list of tuples if the columns are a MultiIndex.
        column_names: Name(s) of columns to unpivot. Should be either
            a single column name or a list/tuple of column names.
            `column_names` should be a list of tuples
            if the columns are a MultiIndex.
        names_to: Name of new column as a string that will contain
            what were previously the column names in `column_names`.
            The default is `variable` if no value is provided. It can
            also be a list/tuple of strings that will serve as new column
            names, if `name_sep` or `names_pattern` is provided.
            If `.value` is in `names_to`, new column names will be extracted
            from part of the existing column names and overrides`values_to`.
        values_to: Name of new column as a string that will contain what
            were previously the values of the columns in `column_names`.
            values_to can also be a list/tuple
            and requires that names_pattern is also a list/tuple.
        column_level: If columns are a MultiIndex, then use this level to
            unpivot the DataFrame. Provided for compatibility with pandas' melt,
            and applies only if neither `names_sep` nor `names_pattern` is
            provided.
        names_sep: Determines how the column name is broken up, if
            `names_to` contains multiple values. It takes the same
            specification as pandas' `str.split` method, and can be a string
            or regular expression. `names_sep` does not work with MultiIndex
            columns.
        names_pattern: Determines how the column name is broken up.
            It can be a regular expression containing matching groups.
            Under the hood it is processed with pandas' `str.extract` function.
            If it is a single regex, the number of groups must match
            the length of `names_to`.
            Named groups are supported, if `names_to` is none. `_` is used
            instead of `.value` as a placeholder in named groups.
            `_` can be overloaded for multiple `.value`
            calls - `_`, `__`, `___`, ...
            `names_pattern` can also be a list/tuple of regular expressions
            It can also be a list/tuple of strings;
            the strings will be treated as regular expressions.
            Under the hood it is processed with pandas' `str.contains` function.
            For a list/tuple of regular expressions,
            `names_to` must also be a list/tuple and the lengths of both
            arguments must match.
            `names_pattern` can also be a dictionary, where the keys are
            the new column names, while the values can be a regular expression
            or a string which will be evaluated as a regular expression.
            Alternatively, a nested dictionary can be used, where the sub
            key(s) are associated with `values_to`. Please have a look
            at the examples for usage.
            `names_pattern` does not work with MultiIndex columns.
        names_transform: Use this option to change the types of columns that
            have been transformed to rows. This does not applies to the values' columns.
            Accepts any argument that is acceptable by `pd.astype`.
        dropna: Determines whether or not to drop nulls
            from the values columns. Default is `False`.
        sort_by_appearance: Boolean value that determines
            the final look of the DataFrame. If `True`, the unpivoted DataFrame
            will be stacked in order of first appearance.
        ignore_index: If `True`,
            the original index is ignored. If `False`, the original index
            is retained and the index labels will be repeated as necessary.

    Returns:
        A pandas DataFrame that has been unpivoted from wide to long
            format.
    """  # noqa: E501

    # this code builds on the wonderful work of @benjaminjack’s PR
    # https://github.com/benjaminjack/pyjanitor/commit/e3df817903c20dd21634461c8a92aec137963ed0

    return _computations_pivot_longer(
        df=df,
        index=index,
        column_names=column_names,
        column_level=column_level,
        names_to=names_to,
        values_to=values_to,
        names_sep=names_sep,
        names_pattern=names_pattern,
        names_transform=names_transform,
        dropna=dropna,
        sort_by_appearance=sort_by_appearance,
        ignore_index=ignore_index,
    )

pivot_wider(df, index=None, names_from=None, values_from=None, flatten_levels=True, names_sep='_', names_glue=None, reset_index=True, names_expand=False, index_expand=False)

Reshapes data from long to wide form.

Note

This function will be deprecated in a 1.x release. Please use pd.DataFrame.pivot instead.

The number of columns are increased, while decreasing the number of rows. It is the inverse of the pivot_longer method, and is a wrapper around pd.DataFrame.pivot method.

This method does not mutate the original DataFrame.

Column selection in index, names_from and values_from is possible using the select syntax.

A ValueError is raised if the combination of the index and names_from is not unique.

By default, values from values_from are always at the top level if the columns are not flattened. If flattened, the values from values_from are usually at the start of each label in the columns.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = [{'dep': 5.5, 'step': 1, 'a': 20, 'b': 30},
...       {'dep': 5.5, 'step': 2, 'a': 25, 'b': 37},
...       {'dep': 6.1, 'step': 1, 'a': 22, 'b': 19},
...       {'dep': 6.1, 'step': 2, 'a': 18, 'b': 29}]
>>> df = pd.DataFrame(df)
>>> df
   dep  step   a   b
0  5.5     1  20  30
1  5.5     2  25  37
2  6.1     1  22  19
3  6.1     2  18  29

Pivot and flatten columns:

>>> df.pivot_wider(
...     index = "dep",
...     names_from = "step",
... )
   dep  a_1  a_2  b_1  b_2
0  5.5   20   25   30   37
1  6.1   22   18   19   29

Modify columns with names_sep:

>>> df.pivot_wider(
...     index = "dep",
...     names_from = "step",
...     names_sep = "",
... )
   dep  a1  a2  b1  b2
0  5.5  20  25  30  37
1  6.1  22  18  19  29

Modify columns with names_glue:

>>> df.pivot_wider(
...     index = "dep",
...     names_from = "step",
...     names_glue = "{_value}_step{step}",
... )
   dep  a_step1  a_step2  b_step1  b_step2
0  5.5       20       25       30       37
1  6.1       22       18       19       29

Expand columns to expose implicit missing values - this applies only to categorical columns:

>>> weekdays = ("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
>>> daily = pd.DataFrame(
...     {
...         "day": pd.Categorical(
...             values=("Tue", "Thu", "Fri", "Mon"), categories=weekdays
...         ),
...         "value": (2, 3, 1, 5),
...     },
... index=[0, 0, 0, 0],
... )
>>> daily
   day  value
0  Tue      2
0  Thu      3
0  Fri      1
0  Mon      5
>>> daily.pivot_wider(names_from='day', values_from='value')
   Tue  Thu  Fri  Mon
0    2    3    1    5
>>> (daily
... .pivot_wider(
...     names_from='day',
...     values_from='value',
...     names_expand=True)
... )
   Mon  Tue  Wed  Thu  Fri  Sat  Sun
0    5    2  NaN    3    1  NaN  NaN

Expand the index to expose implicit missing values - this applies only to categorical columns:

>>> daily = daily.assign(letter = list('ABBA'))
>>> daily
   day  value letter
0  Tue      2      A
0  Thu      3      B
0  Fri      1      B
0  Mon      5      A
>>> daily.pivot_wider(index='day',names_from='letter',values_from='value')
   day    A    B
0  Tue  2.0  NaN
1  Thu  NaN  3.0
2  Fri  NaN  1.0
3  Mon  5.0  NaN
>>> (daily
... .pivot_wider(
...     index='day',
...     names_from='letter',
...     values_from='value',
...     index_expand=True)
... )
   day    A    B
0  Mon  5.0  NaN
1  Tue  2.0  NaN
2  Wed  NaN  NaN
3  Thu  NaN  3.0
4  Fri  NaN  1.0
5  Sat  NaN  NaN
6  Sun  NaN  NaN

Version Changed

  • 0.24.0
    • Added reset_index, names_expand and index_expand parameters.

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
index Optional[Union[list, str]]

Name(s) of columns to use as identifier variables. It should be either a single column name, or a list of column names. If index is not provided, the DataFrame's index is used.

None
names_from Optional[Union[list, str]]

Name(s) of column(s) to use to make the new DataFrame's columns. Should be either a single column name, or a list of column names.

None
values_from Optional[Union[list, str]]

Name(s) of column(s) that will be used for populating the new DataFrame's values. If values_from is not specified, all remaining columns will be used.

None
flatten_levels Optional[bool]

If False, the DataFrame stays as a MultiIndex.

True
names_sep str

If names_from or values_from contain multiple variables, this will be used to join the values into a single string to use as a column name. Default is _. Applicable only if flatten_levels is True.

'_'
names_glue str

A string to control the output of the flattened columns. It offers more flexibility in creating custom column names, and uses python's str.format_map under the hood. Simply create the string template, using the column labels in names_from, and special _value as a placeholder for values_from. Applicable only if flatten_levels is True.

None
reset_index bool

Determines whether to restore index as a column/columns. Applicable only if index is provided, and flatten_levels is True.

True
names_expand bool

Expand columns to show all the categories. Applies only if names_from is a categorical column.

False
index_expand bool

Expand the index to show all the categories. Applies only if index is a categorical column.

False

Returns:

Type Description
DataFrame

A pandas DataFrame that has been unpivoted from long to wide form.

Source code in janitor/functions/pivot.py
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
@pf.register_dataframe_method
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `pd.DataFrame.pivot` instead."
    )
)
def pivot_wider(
    df: pd.DataFrame,
    index: Optional[Union[list, str]] = None,
    names_from: Optional[Union[list, str]] = None,
    values_from: Optional[Union[list, str]] = None,
    flatten_levels: Optional[bool] = True,
    names_sep: str = "_",
    names_glue: str = None,
    reset_index: bool = True,
    names_expand: bool = False,
    index_expand: bool = False,
) -> pd.DataFrame:
    """Reshapes data from *long* to *wide* form.

    !!!note

        This function will be deprecated in a 1.x release.
        Please use `pd.DataFrame.pivot` instead.

    The number of columns are increased, while decreasing
    the number of rows. It is the inverse of the
    [`pivot_longer`][janitor.functions.pivot.pivot_longer]
    method, and is a wrapper around `pd.DataFrame.pivot` method.

    This method does not mutate the original DataFrame.

    Column selection in `index`, `names_from` and `values_from`
    is possible using the
    [`select`][janitor.functions.select.select] syntax.

    A ValueError is raised if the combination
    of the `index` and `names_from` is not unique.

    By default, values from `values_from` are always
    at the top level if the columns are not flattened.
    If flattened, the values from `values_from` are usually
    at the start of each label in the columns.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = [{'dep': 5.5, 'step': 1, 'a': 20, 'b': 30},
        ...       {'dep': 5.5, 'step': 2, 'a': 25, 'b': 37},
        ...       {'dep': 6.1, 'step': 1, 'a': 22, 'b': 19},
        ...       {'dep': 6.1, 'step': 2, 'a': 18, 'b': 29}]
        >>> df = pd.DataFrame(df)
        >>> df
           dep  step   a   b
        0  5.5     1  20  30
        1  5.5     2  25  37
        2  6.1     1  22  19
        3  6.1     2  18  29

        Pivot and flatten columns:
        >>> df.pivot_wider( # doctest: +SKIP
        ...     index = "dep",
        ...     names_from = "step",
        ... )
           dep  a_1  a_2  b_1  b_2
        0  5.5   20   25   30   37
        1  6.1   22   18   19   29

        Modify columns with `names_sep`:
        >>> df.pivot_wider( # doctest: +SKIP
        ...     index = "dep",
        ...     names_from = "step",
        ...     names_sep = "",
        ... )
           dep  a1  a2  b1  b2
        0  5.5  20  25  30  37
        1  6.1  22  18  19  29

        Modify columns with `names_glue`:
        >>> df.pivot_wider( # doctest: +SKIP
        ...     index = "dep",
        ...     names_from = "step",
        ...     names_glue = "{_value}_step{step}",
        ... )
           dep  a_step1  a_step2  b_step1  b_step2
        0  5.5       20       25       30       37
        1  6.1       22       18       19       29

        Expand columns to expose implicit missing values
        - this applies only to categorical columns:
        >>> weekdays = ("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
        >>> daily = pd.DataFrame(
        ...     {
        ...         "day": pd.Categorical(
        ...             values=("Tue", "Thu", "Fri", "Mon"), categories=weekdays
        ...         ),
        ...         "value": (2, 3, 1, 5),
        ...     },
        ... index=[0, 0, 0, 0],
        ... )
        >>> daily
           day  value
        0  Tue      2
        0  Thu      3
        0  Fri      1
        0  Mon      5
        >>> daily.pivot_wider(names_from='day', values_from='value') # doctest: +SKIP
           Tue  Thu  Fri  Mon
        0    2    3    1    5
        >>> (daily # doctest: +SKIP
        ... .pivot_wider(
        ...     names_from='day',
        ...     values_from='value',
        ...     names_expand=True)
        ... )
           Mon  Tue  Wed  Thu  Fri  Sat  Sun
        0    5    2  NaN    3    1  NaN  NaN

        Expand the index to expose implicit missing values
        - this applies only to categorical columns:
        >>> daily = daily.assign(letter = list('ABBA'))
        >>> daily
           day  value letter
        0  Tue      2      A
        0  Thu      3      B
        0  Fri      1      B
        0  Mon      5      A
        >>> daily.pivot_wider(index='day',names_from='letter',values_from='value') # doctest: +SKIP
           day    A    B
        0  Tue  2.0  NaN
        1  Thu  NaN  3.0
        2  Fri  NaN  1.0
        3  Mon  5.0  NaN
        >>> (daily # doctest: +SKIP
        ... .pivot_wider(
        ...     index='day',
        ...     names_from='letter',
        ...     values_from='value',
        ...     index_expand=True)
        ... )
           day    A    B
        0  Mon  5.0  NaN
        1  Tue  2.0  NaN
        2  Wed  NaN  NaN
        3  Thu  NaN  3.0
        4  Fri  NaN  1.0
        5  Sat  NaN  NaN
        6  Sun  NaN  NaN


    !!! abstract "Version Changed"

        - 0.24.0
            - Added `reset_index`, `names_expand` and `index_expand` parameters.

    Args:
        df: A pandas DataFrame.
        index: Name(s) of columns to use as identifier variables.
            It should be either a single column name, or a list of column names.
            If `index` is not provided, the DataFrame's index is used.
        names_from: Name(s) of column(s) to use to make the new
            DataFrame's columns. Should be either a single column name,
            or a list of column names.
        values_from: Name(s) of column(s) that will be used for populating
            the new DataFrame's values.
            If `values_from` is not specified,  all remaining columns
            will be used.
        flatten_levels: If `False`, the DataFrame stays as a MultiIndex.
        names_sep: If `names_from` or `values_from` contain multiple
            variables, this will be used to join the values into a single string
            to use as a column name. Default is `_`.
            Applicable only if `flatten_levels` is `True`.
        names_glue: A string to control the output of the flattened columns.
            It offers more flexibility in creating custom column names,
            and uses python's `str.format_map` under the hood.
            Simply create the string template,
            using the column labels in `names_from`,
            and special `_value` as a placeholder for `values_from`.
            Applicable only if `flatten_levels` is `True`.
        reset_index: Determines whether to restore `index`
            as a column/columns. Applicable only if `index` is provided,
            and `flatten_levels` is `True`.
        names_expand: Expand columns to show all the categories.
            Applies only if `names_from` is a categorical column.
        index_expand: Expand the index to show all the categories.
            Applies only if `index` is a categorical column.

    Returns:
        A pandas DataFrame that has been unpivoted from long to wide form.
    """  # noqa: E501

    # no need for an explicit copy --> df = df.copy()
    # `pd.pivot` creates one
    return _computations_pivot_wider(
        df,
        index,
        names_from,
        values_from,
        flatten_levels,
        names_sep,
        names_glue,
        reset_index,
        names_expand,
        index_expand,
    )

rename_column(df, old_column_name, new_column_name)

Rename a column in place.

This method does not mutate the original DataFrame.

Note

This function will be deprecated in a 1.x release. Please use pd.DataFrame.rename instead.

This is just syntactic sugar/a convenience function for renaming one column at a time. If you are convinced that there are multiple columns in need of changing, then use the pandas.DataFrame.rename method.

Examples:

Change the name of column 'a' to 'a_new'.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
>>> df.rename_column(old_column_name='a', new_column_name='a_new')
   a_new  b
0      0  a
1      1  b
2      2  c

Parameters:

Name Type Description Default
df DataFrame

The pandas DataFrame object.

required
old_column_name str

The old column name.

required
new_column_name str

The new column name.

required

Returns:

Type Description
DataFrame

A pandas DataFrame with renamed columns.

Source code in janitor/functions/rename_columns.py
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
@pf.register_dataframe_method
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `pd.DataFrame.rename` instead."
    )
)
@deprecated_alias(old="old_column_name", new="new_column_name")
def rename_column(
    df: pd.DataFrame,
    old_column_name: str,
    new_column_name: str,
) -> pd.DataFrame:
    """Rename a column in place.

    This method does not mutate the original DataFrame.

    !!!note

        This function will be deprecated in a 1.x release.
        Please use `pd.DataFrame.rename` instead.

    This is just syntactic sugar/a convenience function for renaming one column at a time.
    If you are convinced that there are multiple columns in need of changing,
    then use the `pandas.DataFrame.rename` method.

    Examples:
        Change the name of column 'a' to 'a_new'.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
        >>> df.rename_column(old_column_name='a', new_column_name='a_new')
           a_new  b
        0      0  a
        1      1  b
        2      2  c

    Args:
        df: The pandas DataFrame object.
        old_column_name: The old column name.
        new_column_name: The new column name.

    Returns:
        A pandas DataFrame with renamed columns.
    """  # noqa: E501

    check_column(df, [old_column_name])

    return df.rename(columns={old_column_name: new_column_name})

select_columns(df, *args, invert=False)

Method-chainable selection of columns.

It accepts a string, shell-like glob strings (*string*), regex, slice, array-like object, or a list of the previous options.

Selection on a MultiIndex on a level, or multiple levels, is possible with a dictionary.

This method does not mutate the original DataFrame.

Optional ability to invert selection of columns available as well.

Note

The preferred option when selecting columns or rows in a Pandas DataFrame is with .loc or .iloc methods. select_columns is primarily for convenience.

Note

This function will be deprecated in a 1.x release. Please use jn.select instead.

Examples:

>>> import pandas as pd
>>> import janitor
>>> from numpy import nan
>>> pd.set_option("display.max_columns", None)
>>> pd.set_option("display.expand_frame_repr", False)
>>> pd.set_option("max_colwidth", None)
>>> data = {'name': ['Cheetah','Owl monkey','Mountain beaver',
...                  'Greater short-tailed shrew','Cow'],
...         'genus': ['Acinonyx', 'Aotus', 'Aplodontia', 'Blarina', 'Bos'],
...         'vore': ['carni', 'omni', 'herbi', 'omni', 'herbi'],
...         'order': ['Carnivora','Primates','Rodentia','Soricomorpha','Artiodactyla'],
...         'conservation': ['lc', nan, 'nt', 'lc', 'domesticated'],
...         'sleep_total': [12.1, 17.0, 14.4, 14.9, 4.0],
...         'sleep_rem': [nan, 1.8, 2.4, 2.3, 0.7],
...         'sleep_cycle': [nan, nan, nan, 0.133333333, 0.666666667],
...         'awake': [11.9, 7.0, 9.6, 9.1, 20.0],
...         'brainwt': [nan, 0.0155, nan, 0.00029, 0.423],
...         'bodywt': [50.0, 0.48, 1.35, 0.019, 600.0]}
>>> df = pd.DataFrame(data)
>>> df
                         name       genus   vore         order  conservation  sleep_total  sleep_rem  sleep_cycle  awake  brainwt   bodywt
0                     Cheetah    Acinonyx  carni     Carnivora            lc         12.1        NaN          NaN   11.9      NaN   50.000
1                  Owl monkey       Aotus   omni      Primates           NaN         17.0        1.8          NaN    7.0  0.01550    0.480
2             Mountain beaver  Aplodontia  herbi      Rodentia            nt         14.4        2.4          NaN    9.6      NaN    1.350
3  Greater short-tailed shrew     Blarina   omni  Soricomorpha            lc         14.9        2.3     0.133333    9.1  0.00029    0.019
4                         Cow         Bos  herbi  Artiodactyla  domesticated          4.0        0.7     0.666667   20.0  0.42300  600.000

Explicit label selection:

>>> df.select_columns('name', 'order')
                         name         order
0                     Cheetah     Carnivora
1                  Owl monkey      Primates
2             Mountain beaver      Rodentia
3  Greater short-tailed shrew  Soricomorpha
4                         Cow  Artiodactyla

Selection via globbing:

>>> df.select_columns("sleep*", "*wt")
   sleep_total  sleep_rem  sleep_cycle  brainwt   bodywt
0         12.1        NaN          NaN      NaN   50.000
1         17.0        1.8          NaN  0.01550    0.480
2         14.4        2.4          NaN      NaN    1.350
3         14.9        2.3     0.133333  0.00029    0.019
4          4.0        0.7     0.666667  0.42300  600.000

Selection via regex:

>>> import re
>>> df.select_columns(re.compile(r"o.+er"))
          order  conservation
0     Carnivora            lc
1      Primates           NaN
2      Rodentia            nt
3  Soricomorpha            lc
4  Artiodactyla  domesticated

Selection via slicing:

>>> df.select_columns(slice('name','order'), slice('sleep_total','sleep_cycle'))
                         name       genus   vore         order  sleep_total  sleep_rem  sleep_cycle
0                     Cheetah    Acinonyx  carni     Carnivora         12.1        NaN          NaN
1                  Owl monkey       Aotus   omni      Primates         17.0        1.8          NaN
2             Mountain beaver  Aplodontia  herbi      Rodentia         14.4        2.4          NaN
3  Greater short-tailed shrew     Blarina   omni  Soricomorpha         14.9        2.3     0.133333
4                         Cow         Bos  herbi  Artiodactyla          4.0        0.7     0.666667

Selection via callable:

>>> from pandas.api.types import is_numeric_dtype
>>> df.select_columns(is_numeric_dtype)
   sleep_total  sleep_rem  sleep_cycle  awake  brainwt   bodywt
0         12.1        NaN          NaN   11.9      NaN   50.000
1         17.0        1.8          NaN    7.0  0.01550    0.480
2         14.4        2.4          NaN    9.6      NaN    1.350
3         14.9        2.3     0.133333    9.1  0.00029    0.019
4          4.0        0.7     0.666667   20.0  0.42300  600.000
>>> df.select_columns(lambda f: f.isna().any())
   conservation  sleep_rem  sleep_cycle  brainwt
0            lc        NaN          NaN      NaN
1           NaN        1.8          NaN  0.01550
2            nt        2.4          NaN      NaN
3            lc        2.3     0.133333  0.00029
4  domesticated        0.7     0.666667  0.42300

Exclude columns with the invert parameter:

>>> df.select_columns(is_numeric_dtype, invert=True)
                         name       genus   vore         order  conservation
0                     Cheetah    Acinonyx  carni     Carnivora            lc
1                  Owl monkey       Aotus   omni      Primates           NaN
2             Mountain beaver  Aplodontia  herbi      Rodentia            nt
3  Greater short-tailed shrew     Blarina   omni  Soricomorpha            lc
4                         Cow         Bos  herbi  Artiodactyla  domesticated

Exclude columns with the DropLabel class:

>>> from janitor import DropLabel
>>> df.select_columns(DropLabel(slice("name", "awake")), "conservation")
   brainwt   bodywt  conservation
0      NaN   50.000            lc
1  0.01550    0.480           NaN
2      NaN    1.350            nt
3  0.00029    0.019            lc
4  0.42300  600.000  domesticated

Selection on MultiIndex columns:

>>> d = {'num_legs': [4, 4, 2, 2],
...      'num_wings': [0, 0, 2, 2],
...      'class': ['mammal', 'mammal', 'mammal', 'bird'],
...      'animal': ['cat', 'dog', 'bat', 'penguin'],
...      'locomotion': ['walks', 'walks', 'flies', 'walks']}
>>> df = pd.DataFrame(data=d)
>>> df = df.set_index(['class', 'animal', 'locomotion']).T
>>> df
class      mammal                bird
animal        cat   dog   bat penguin
locomotion  walks walks flies   walks
num_legs        4     4     2       2
num_wings       0     0     2       2

Selection with a scalar:

>>> df.select_columns('mammal')
class      mammal
animal        cat   dog   bat
locomotion  walks walks flies
num_legs        4     4     2
num_wings       0     0     2

Selection with a tuple:

>>> df.select_columns(('mammal','bat'))
class      mammal
animal        bat
locomotion  flies
num_legs        2
num_wings       2

Selection within a level is possible with a dictionary, where the key is either a level name or number:

>>> df.select_columns({'animal':'cat'})
class      mammal
animal        cat
locomotion  walks
num_legs        4
num_wings       0
>>> df.select_columns({1:["bat", "cat"]})
class      mammal
animal        bat   cat
locomotion  flies walks
num_legs        2     4
num_wings       2     0

Selection on multiple levels:

>>> df.select_columns({"class":"mammal", "locomotion":"flies"})
class      mammal
animal        bat
locomotion  flies
num_legs        2
num_wings       2

Selection with a regex on a level:

>>> df.select_columns({"animal":re.compile(".+t$")})
class      mammal
animal        cat   bat
locomotion  walks flies
num_legs        4     2
num_wings       0     2

Selection with a callable on a level:

>>> df.select_columns({"animal":lambda f: f.str.endswith('t')})
class      mammal
animal        cat   bat
locomotion  walks flies
num_legs        4     2
num_wings       0     2

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
*args Any

Valid inputs include: an exact column name to look for, a shell-style glob string (e.g. *_thing_*), a regular expression, a callable, or variable arguments of all the aforementioned. A sequence of booleans is also acceptable. A dictionary can be used for selection on a MultiIndex on different levels.

()
invert bool

Whether or not to invert the selection. This will result in the selection of the complement of the columns provided.

False

Returns:

Type Description
DataFrame

A pandas DataFrame with the specified columns selected.

Source code in janitor/functions/select.py
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
@pf.register_dataframe_method
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `jn.select` instead."
    )
)
def select_columns(
    df: pd.DataFrame,
    *args: Any,
    invert: bool = False,
) -> pd.DataFrame:
    """Method-chainable selection of columns.

    It accepts a string, shell-like glob strings `(*string*)`,
    regex, slice, array-like object, or a list of the previous options.

    Selection on a MultiIndex on a level, or multiple levels,
    is possible with a dictionary.

    This method does not mutate the original DataFrame.

    Optional ability to invert selection of columns available as well.

    !!!note

        The preferred option when selecting columns or rows in a Pandas DataFrame
        is with `.loc` or `.iloc` methods.
        `select_columns` is primarily for convenience.

    !!!note

        This function will be deprecated in a 1.x release.
        Please use `jn.select` instead.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> from numpy import nan
        >>> pd.set_option("display.max_columns", None)
        >>> pd.set_option("display.expand_frame_repr", False)
        >>> pd.set_option("max_colwidth", None)
        >>> data = {'name': ['Cheetah','Owl monkey','Mountain beaver',
        ...                  'Greater short-tailed shrew','Cow'],
        ...         'genus': ['Acinonyx', 'Aotus', 'Aplodontia', 'Blarina', 'Bos'],
        ...         'vore': ['carni', 'omni', 'herbi', 'omni', 'herbi'],
        ...         'order': ['Carnivora','Primates','Rodentia','Soricomorpha','Artiodactyla'],
        ...         'conservation': ['lc', nan, 'nt', 'lc', 'domesticated'],
        ...         'sleep_total': [12.1, 17.0, 14.4, 14.9, 4.0],
        ...         'sleep_rem': [nan, 1.8, 2.4, 2.3, 0.7],
        ...         'sleep_cycle': [nan, nan, nan, 0.133333333, 0.666666667],
        ...         'awake': [11.9, 7.0, 9.6, 9.1, 20.0],
        ...         'brainwt': [nan, 0.0155, nan, 0.00029, 0.423],
        ...         'bodywt': [50.0, 0.48, 1.35, 0.019, 600.0]}
        >>> df = pd.DataFrame(data)
        >>> df
                                 name       genus   vore         order  conservation  sleep_total  sleep_rem  sleep_cycle  awake  brainwt   bodywt
        0                     Cheetah    Acinonyx  carni     Carnivora            lc         12.1        NaN          NaN   11.9      NaN   50.000
        1                  Owl monkey       Aotus   omni      Primates           NaN         17.0        1.8          NaN    7.0  0.01550    0.480
        2             Mountain beaver  Aplodontia  herbi      Rodentia            nt         14.4        2.4          NaN    9.6      NaN    1.350
        3  Greater short-tailed shrew     Blarina   omni  Soricomorpha            lc         14.9        2.3     0.133333    9.1  0.00029    0.019
        4                         Cow         Bos  herbi  Artiodactyla  domesticated          4.0        0.7     0.666667   20.0  0.42300  600.000

        Explicit label selection:
        >>> df.select_columns('name', 'order')
                                 name         order
        0                     Cheetah     Carnivora
        1                  Owl monkey      Primates
        2             Mountain beaver      Rodentia
        3  Greater short-tailed shrew  Soricomorpha
        4                         Cow  Artiodactyla

        Selection via globbing:
        >>> df.select_columns("sleep*", "*wt")
           sleep_total  sleep_rem  sleep_cycle  brainwt   bodywt
        0         12.1        NaN          NaN      NaN   50.000
        1         17.0        1.8          NaN  0.01550    0.480
        2         14.4        2.4          NaN      NaN    1.350
        3         14.9        2.3     0.133333  0.00029    0.019
        4          4.0        0.7     0.666667  0.42300  600.000

        Selection via regex:
        >>> import re
        >>> df.select_columns(re.compile(r"o.+er"))
                  order  conservation
        0     Carnivora            lc
        1      Primates           NaN
        2      Rodentia            nt
        3  Soricomorpha            lc
        4  Artiodactyla  domesticated

        Selection via slicing:
        >>> df.select_columns(slice('name','order'), slice('sleep_total','sleep_cycle'))
                                 name       genus   vore         order  sleep_total  sleep_rem  sleep_cycle
        0                     Cheetah    Acinonyx  carni     Carnivora         12.1        NaN          NaN
        1                  Owl monkey       Aotus   omni      Primates         17.0        1.8          NaN
        2             Mountain beaver  Aplodontia  herbi      Rodentia         14.4        2.4          NaN
        3  Greater short-tailed shrew     Blarina   omni  Soricomorpha         14.9        2.3     0.133333
        4                         Cow         Bos  herbi  Artiodactyla          4.0        0.7     0.666667

        Selection via callable:
        >>> from pandas.api.types import is_numeric_dtype
        >>> df.select_columns(is_numeric_dtype)
           sleep_total  sleep_rem  sleep_cycle  awake  brainwt   bodywt
        0         12.1        NaN          NaN   11.9      NaN   50.000
        1         17.0        1.8          NaN    7.0  0.01550    0.480
        2         14.4        2.4          NaN    9.6      NaN    1.350
        3         14.9        2.3     0.133333    9.1  0.00029    0.019
        4          4.0        0.7     0.666667   20.0  0.42300  600.000
        >>> df.select_columns(lambda f: f.isna().any())
           conservation  sleep_rem  sleep_cycle  brainwt
        0            lc        NaN          NaN      NaN
        1           NaN        1.8          NaN  0.01550
        2            nt        2.4          NaN      NaN
        3            lc        2.3     0.133333  0.00029
        4  domesticated        0.7     0.666667  0.42300

        Exclude columns with the `invert` parameter:
        >>> df.select_columns(is_numeric_dtype, invert=True)
                                 name       genus   vore         order  conservation
        0                     Cheetah    Acinonyx  carni     Carnivora            lc
        1                  Owl monkey       Aotus   omni      Primates           NaN
        2             Mountain beaver  Aplodontia  herbi      Rodentia            nt
        3  Greater short-tailed shrew     Blarina   omni  Soricomorpha            lc
        4                         Cow         Bos  herbi  Artiodactyla  domesticated

        Exclude columns with the `DropLabel` class:
        >>> from janitor import DropLabel
        >>> df.select_columns(DropLabel(slice("name", "awake")), "conservation")
           brainwt   bodywt  conservation
        0      NaN   50.000            lc
        1  0.01550    0.480           NaN
        2      NaN    1.350            nt
        3  0.00029    0.019            lc
        4  0.42300  600.000  domesticated

        Selection on MultiIndex columns:
        >>> d = {'num_legs': [4, 4, 2, 2],
        ...      'num_wings': [0, 0, 2, 2],
        ...      'class': ['mammal', 'mammal', 'mammal', 'bird'],
        ...      'animal': ['cat', 'dog', 'bat', 'penguin'],
        ...      'locomotion': ['walks', 'walks', 'flies', 'walks']}
        >>> df = pd.DataFrame(data=d)
        >>> df = df.set_index(['class', 'animal', 'locomotion']).T
        >>> df
        class      mammal                bird
        animal        cat   dog   bat penguin
        locomotion  walks walks flies   walks
        num_legs        4     4     2       2
        num_wings       0     0     2       2

        Selection with a scalar:
        >>> df.select_columns('mammal')
        class      mammal
        animal        cat   dog   bat
        locomotion  walks walks flies
        num_legs        4     4     2
        num_wings       0     0     2

        Selection with a tuple:
        >>> df.select_columns(('mammal','bat'))
        class      mammal
        animal        bat
        locomotion  flies
        num_legs        2
        num_wings       2

        Selection within a level is possible with a dictionary,
        where the key is either a level name or number:
        >>> df.select_columns({'animal':'cat'})
        class      mammal
        animal        cat
        locomotion  walks
        num_legs        4
        num_wings       0
        >>> df.select_columns({1:["bat", "cat"]})
        class      mammal
        animal        bat   cat
        locomotion  flies walks
        num_legs        2     4
        num_wings       2     0

        Selection on multiple levels:
        >>> df.select_columns({"class":"mammal", "locomotion":"flies"})
        class      mammal
        animal        bat
        locomotion  flies
        num_legs        2
        num_wings       2

        Selection with a regex on a level:
        >>> df.select_columns({"animal":re.compile(".+t$")})
        class      mammal
        animal        cat   bat
        locomotion  walks flies
        num_legs        4     2
        num_wings       0     2

        Selection with a callable on a level:
        >>> df.select_columns({"animal":lambda f: f.str.endswith('t')})
        class      mammal
        animal        cat   bat
        locomotion  walks flies
        num_legs        4     2
        num_wings       0     2

    Args:
        df: A pandas DataFrame.
        *args: Valid inputs include: an exact column name to look for,
            a shell-style glob string (e.g. `*_thing_*`),
            a regular expression,
            a callable,
            or variable arguments of all the aforementioned.
            A sequence of booleans is also acceptable.
            A dictionary can be used for selection
            on a MultiIndex on different levels.
        invert: Whether or not to invert the selection.
            This will result in the selection
            of the complement of the columns provided.

    Returns:
        A pandas DataFrame with the specified columns selected.
    """  # noqa: E501

    return _select(df, columns=list(args), invert=invert)

select_rows(df, *args, invert=False)

Method-chainable selection of rows.

It accepts a string, shell-like glob strings (*string*), regex, slice, array-like object, or a list of the previous options.

Selection on a MultiIndex on a level, or multiple levels, is possible with a dictionary.

This method does not mutate the original DataFrame.

Optional ability to invert selection of rows available as well.

New in version 0.24.0

Note

The preferred option when selecting columns or rows in a Pandas DataFrame is with .loc or .iloc methods, as they are generally performant. select_rows is primarily for convenience.

Note

This function will be deprecated in a 1.x release. Please use jn.select instead.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = {"col1": [1, 2], "foo": [3, 4], "col2": [5, 6]}
>>> df = pd.DataFrame.from_dict(df, orient='index')
>>> df
      0  1
col1  1  2
foo   3  4
col2  5  6
>>> df.select_rows("col*")
      0  1
col1  1  2
col2  5  6

More examples can be found in the select_columns section.

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
*args Any

Valid inputs include: an exact index name to look for, a shell-style glob string (e.g. *_thing_*), a regular expression, a callable, or variable arguments of all the aforementioned. A sequence of booleans is also acceptable. A dictionary can be used for selection on a MultiIndex on different levels.

()
invert bool

Whether or not to invert the selection. This will result in the selection of the complement of the rows provided.

False

Returns:

Type Description
DataFrame

A pandas DataFrame with the specified rows selected.

Source code in janitor/functions/select.py
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
@pf.register_dataframe_method
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `jn.select` instead."
    )
)
def select_rows(
    df: pd.DataFrame,
    *args: Any,
    invert: bool = False,
) -> pd.DataFrame:
    """Method-chainable selection of rows.

    It accepts a string, shell-like glob strings `(*string*)`,
    regex, slice, array-like object, or a list of the previous options.

    Selection on a MultiIndex on a level, or multiple levels,
    is possible with a dictionary.

    This method does not mutate the original DataFrame.

    Optional ability to invert selection of rows available as well.


    !!! info "New in version 0.24.0"

    !!!note

        The preferred option when selecting columns or rows in a Pandas DataFrame
        is with `.loc` or `.iloc` methods, as they are generally performant.
        `select_rows` is primarily for convenience.

    !!!note

        This function will be deprecated in a 1.x release.
        Please use `jn.select` instead.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = {"col1": [1, 2], "foo": [3, 4], "col2": [5, 6]}
        >>> df = pd.DataFrame.from_dict(df, orient='index')
        >>> df
              0  1
        col1  1  2
        foo   3  4
        col2  5  6
        >>> df.select_rows("col*")
              0  1
        col1  1  2
        col2  5  6

    More examples can be found in the
    [`select_columns`][janitor.functions.select.select_columns] section.

    Args:
        df: A pandas DataFrame.
        *args: Valid inputs include: an exact index name to look for,
            a shell-style glob string (e.g. `*_thing_*`),
            a regular expression,
            a callable,
            or variable arguments of all the aforementioned.
            A sequence of booleans is also acceptable.
            A dictionary can be used for selection
            on a MultiIndex on different levels.
        invert: Whether or not to invert the selection.
            This will result in the selection
            of the complement of the rows provided.

    Returns:
        A pandas DataFrame with the specified rows selected.
    """  # noqa: E501
    return _select(df, rows=list(args), invert=invert)

transform_column(df, column_name, function, dest_column_name=None, elementwise=True)

Transform the given column using the provided function.

Meant to be the method-chaining equivalent of:

df[dest_column_name] = df[column_name].apply(function)

Functions can be applied in one of two ways:

  • Element-wise (default; elementwise=True). Then, the individual column elements will be passed in as the first argument of function.
  • Column-wise (elementwise=False). Then, function is expected to take in a pandas Series and return a sequence that is of identical length to the original.

If dest_column_name is provided, then the transformation result is stored in that column. Otherwise, the transformed result is stored under the name of the original column.

This method does not mutate the original DataFrame.

Examples:

Transform a column in-place with an element-wise function.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "a": [2, 3, 4],
...     "b": ["area", "pyjanitor", "grapefruit"],
... })
>>> df
   a           b
0  2        area
1  3   pyjanitor
2  4  grapefruit
>>> df.transform_column(
...     column_name="a",
...     function=lambda x: x**2 - 1,
... )
    a           b
0   3        area
1   8   pyjanitor
2  15  grapefruit

Examples:

Transform a column in-place with an column-wise function.

>>> df.transform_column(
...     column_name="b",
...     function=lambda srs: srs.str[:5],
...     elementwise=False,
... )
   a      b
0  2   area
1  3  pyjan
2  4  grape

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_name Hashable

The column to transform.

required
function Callable

A function to apply on the column.

required
dest_column_name Optional[str]

The column name to store the transformation result in. Defaults to None, which will result in the original column name being overwritten. If a name is provided here, then a new column with the transformed values will be created.

None
elementwise bool

Whether to apply the function elementwise or not. If elementwise is True, then the function's first argument should be the data type of each datum in the column of data, and should return a transformed datum. If elementwise is False, then the function's should expect a pandas Series passed into it, and return a pandas Series.

True

Returns:

Type Description
DataFrame

A pandas DataFrame with a transformed column.

Source code in janitor/functions/transform_columns.py
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
@pf.register_dataframe_method
@deprecated_alias(col_name="column_name", dest_col_name="dest_column_name")
def transform_column(
    df: pd.DataFrame,
    column_name: Hashable,
    function: Callable,
    dest_column_name: Optional[str] = None,
    elementwise: bool = True,
) -> pd.DataFrame:
    """Transform the given column using the provided function.

    Meant to be the method-chaining equivalent of:
    ```python
    df[dest_column_name] = df[column_name].apply(function)
    ```

    Functions can be applied in one of two ways:

    - **Element-wise** (default; `elementwise=True`). Then, the individual
    column elements will be passed in as the first argument of `function`.
    - **Column-wise** (`elementwise=False`). Then, `function` is expected to
    take in a pandas Series and return a sequence that is of identical length
    to the original.

    If `dest_column_name` is provided, then the transformation result is stored
    in that column. Otherwise, the transformed result is stored under the name
    of the original column.

    This method does not mutate the original DataFrame.

    Examples:
        Transform a column in-place with an element-wise function.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "a": [2, 3, 4],
        ...     "b": ["area", "pyjanitor", "grapefruit"],
        ... })
        >>> df
           a           b
        0  2        area
        1  3   pyjanitor
        2  4  grapefruit
        >>> df.transform_column(
        ...     column_name="a",
        ...     function=lambda x: x**2 - 1,
        ... )
            a           b
        0   3        area
        1   8   pyjanitor
        2  15  grapefruit

    Examples:
        Transform a column in-place with an column-wise function.

        >>> df.transform_column(
        ...     column_name="b",
        ...     function=lambda srs: srs.str[:5],
        ...     elementwise=False,
        ... )
           a      b
        0  2   area
        1  3  pyjan
        2  4  grape

    Args:
        df: A pandas DataFrame.
        column_name: The column to transform.
        function: A function to apply on the column.
        dest_column_name: The column name to store the transformation result
            in. Defaults to None, which will result in the original column
            name being overwritten. If a name is provided here, then a new
            column with the transformed values will be created.
        elementwise: Whether to apply the function elementwise or not.
            If `elementwise` is True, then the function's first argument
            should be the data type of each datum in the column of data,
            and should return a transformed datum.
            If `elementwise` is False, then the function's should expect
            a pandas Series passed into it, and return a pandas Series.

    Returns:
        A pandas DataFrame with a transformed column.
    """
    check_column(df, column_name)

    if dest_column_name is None:
        dest_column_name = column_name
    elif dest_column_name != column_name:
        # If `dest_column_name` is provided and equals `column_name`, then we
        # assume that the user's intent is to perform an in-place
        # transformation (Same behaviour as when `dest_column_name` = None).
        # Otherwise we throw an error if `dest_column_name` already exists in
        # df.
        check_column(df, dest_column_name, present=False)

    result = _get_transform_column_result(
        df[column_name],
        function,
        elementwise,
    )

    return df.assign(**{dest_column_name: result})

truncate_datetime_dataframe(df, datepart)

Truncate times down to a user-specified precision of year, month, day, hour, minute, or second.

This method does not mutate the original DataFrame.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "foo": ["xxxx", "yyyy", "zzzz"],
...     "dt": pd.date_range("2020-03-11", periods=3, freq="15H"),
... })
>>> df
    foo                  dt
0  xxxx 2020-03-11 00:00:00
1  yyyy 2020-03-11 15:00:00
2  zzzz 2020-03-12 06:00:00
>>> df.truncate_datetime_dataframe("day")
    foo         dt
0  xxxx 2020-03-11
1  yyyy 2020-03-11
2  zzzz 2020-03-12

Parameters:

Name Type Description Default
df DataFrame

The pandas DataFrame on which to truncate datetime.

required
datepart str

Truncation precision, YEAR, MONTH, DAY, HOUR, MINUTE, SECOND. (String is automagically capitalized)

required

Raises:

Type Description
ValueError

If an invalid datepart precision is passed in.

Returns:

Type Description
DataFrame

A pandas DataFrame with all valid datetimes truncated down to the specified precision.

Source code in janitor/functions/truncate_datetime.py
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
@pf.register_dataframe_method
def truncate_datetime_dataframe(
    df: pd.DataFrame,
    datepart: str,
) -> pd.DataFrame:
    """Truncate times down to a user-specified precision of
    year, month, day, hour, minute, or second.

    This method does not mutate the original DataFrame.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "foo": ["xxxx", "yyyy", "zzzz"],
        ...     "dt": pd.date_range("2020-03-11", periods=3, freq="15H"),
        ... })
        >>> df
            foo                  dt
        0  xxxx 2020-03-11 00:00:00
        1  yyyy 2020-03-11 15:00:00
        2  zzzz 2020-03-12 06:00:00
        >>> df.truncate_datetime_dataframe("day")
            foo         dt
        0  xxxx 2020-03-11
        1  yyyy 2020-03-11
        2  zzzz 2020-03-12

    Args:
        df: The pandas DataFrame on which to truncate datetime.
        datepart: Truncation precision, YEAR, MONTH, DAY,
            HOUR, MINUTE, SECOND. (String is automagically
            capitalized)

    Raises:
        ValueError: If an invalid `datepart` precision is passed in.

    Returns:
        A pandas DataFrame with all valid datetimes truncated down
            to the specified precision.
    """
    # idea from Stack Overflow
    # https://stackoverflow.com/a/28783971/7175713
    # https://numpy.org/doc/stable/reference/arrays.datetime.html
    ACCEPTABLE_DATEPARTS = {
        "YEAR": "datetime64[Y]",
        "MONTH": "datetime64[M]",
        "DAY": "datetime64[D]",
        "HOUR": "datetime64[h]",
        "MINUTE": "datetime64[m]",
        "SECOND": "datetime64[s]",
    }
    datepart = datepart.upper()
    if datepart not in ACCEPTABLE_DATEPARTS:
        raise ValueError(
            "Received an invalid `datepart` precision. "
            f"Please enter any one of {ACCEPTABLE_DATEPARTS}."
        )

    dictionary = {}

    for label, series in df.items():
        if is_datetime64_any_dtype(series):
            dtype = ACCEPTABLE_DATEPARTS[datepart]
            # TODO: add branch for pyarrow arrays
            series = np.array(series._values, dtype=dtype)
        dictionary[label] = series

    return pd.DataFrame(dictionary)

unionize_dataframe_categories(*dataframes, column_names=None)

Given a group of dataframes which contain some categorical columns, for each categorical column present, find all the possible categories across all the dataframes which have that column. Update each dataframes' corresponding column with a new categorical object that contains the original data but has labels for all the possible categories from all dataframes. This is useful when concatenating a list of dataframes which all have the same categorical columns into one dataframe.

If, for a given categorical column, all input dataframes do not have at least one instance of all the possible categories, Pandas will change the output dtype of that column from category to object, losing out on dramatic speed gains you get from the former format.

Examples:

Usage example for concatenation of categorical column-containing dataframes:

Instead of:

concatenated_df = pd.concat([df1, df2, df3], ignore_index=True)

which in your case has resulted in category -> object conversion, use:

unionized_dataframes = unionize_dataframe_categories(df1, df2, df2)
concatenated_df = pd.concat(unionized_dataframes, ignore_index=True)

Parameters:

Name Type Description Default
*dataframes Any

The dataframes you wish to unionize the categorical objects for.

()
column_names Optional[Iterable[CategoricalDtype]]

If supplied, only unionize this subset of columns.

None

Raises:

Type Description
TypeError

If any of the inputs are not pandas DataFrames.

Returns:

Type Description
List[DataFrame]

A list of the category-unioned dataframes in the same order they were provided.

Source code in janitor/functions/utils.py
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
def unionize_dataframe_categories(
    *dataframes: Any,
    column_names: Optional[Iterable[pd.CategoricalDtype]] = None,
) -> List[pd.DataFrame]:
    """
    Given a group of dataframes which contain some categorical columns, for
    each categorical column present, find all the possible categories across
    all the dataframes which have that column.
    Update each dataframes' corresponding column with a new categorical object
    that contains the original data
    but has labels for all the possible categories from all dataframes.
    This is useful when concatenating a list of dataframes which all have the
    same categorical columns into one dataframe.

    If, for a given categorical column, all input dataframes do not have at
    least one instance of all the possible categories,
    Pandas will change the output dtype of that column from `category` to
    `object`, losing out on dramatic speed gains you get from the former
    format.

    Examples:
        Usage example for concatenation of categorical column-containing
        dataframes:

        Instead of:

        ```python
        concatenated_df = pd.concat([df1, df2, df3], ignore_index=True)
        ```

        which in your case has resulted in `category` -> `object` conversion,
        use:

        ```python
        unionized_dataframes = unionize_dataframe_categories(df1, df2, df2)
        concatenated_df = pd.concat(unionized_dataframes, ignore_index=True)
        ```

    Args:
        *dataframes: The dataframes you wish to unionize the categorical
            objects for.
        column_names: If supplied, only unionize this subset of columns.

    Raises:
        TypeError: If any of the inputs are not pandas DataFrames.

    Returns:
        A list of the category-unioned dataframes in the same order they
            were provided.
    """

    if any(not isinstance(df, pd.DataFrame) for df in dataframes):
        raise TypeError("Inputs must all be dataframes.")

    if column_names is None:
        # Find all columns across all dataframes that are categorical

        column_names = set()

        for dataframe in dataframes:
            column_names = column_names.union(
                [
                    column_name
                    for column_name in dataframe.columns
                    if isinstance(
                        dataframe[column_name].dtype, pd.CategoricalDtype
                    )
                ]
            )

    else:
        column_names = [column_names]
    # For each categorical column, find all possible values across the DFs

    category_unions = {
        column_name: union_categoricals(
            [df[column_name] for df in dataframes if column_name in df.columns]
        )
        for column_name in column_names
    }

    # Make a shallow copy of all DFs and modify the categorical columns
    # such that they can encode the union of all possible categories for each.

    refactored_dfs = []

    for df in dataframes:
        df = df.copy(deep=False)

        for column_name, categorical in category_unions.items():
            if column_name in df.columns:
                df[column_name] = pd.Categorical(
                    df[column_name], categories=categorical.categories
                )

        refactored_dfs.append(df)

    return refactored_dfs

add_columns

add_column(df, column_name, value, fill_remaining=False)

Add a column to the dataframe.

Intended to be the method-chaining alternative to:

df[column_name] = value

Note

This function will be deprecated in a 1.x release. Please use pd.DataFrame.assign instead.

Examples:

Add a column of constant values to the dataframe.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
>>> df.add_column(column_name="c", value=1)
   a  b  c
0  0  a  1
1  1  b  1
2  2  c  1

Add a column of different values to the dataframe.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
>>> df.add_column(column_name="c", value=list("efg"))
   a  b  c
0  0  a  e
1  1  b  f
2  2  c  g

Add a column using an iterator.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
>>> df.add_column(column_name="c", value=range(4, 7))
   a  b  c
0  0  a  4
1  1  b  5
2  2  c  6

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_name str

Name of the new column. Should be a string, in order for the column name to be compatible with the Feather binary format (this is a useful thing to have).

required
value Union[List[Any], Tuple[Any], Any]

Either a single value, or a list/tuple of values.

required
fill_remaining bool

If value is a tuple or list that is smaller than the number of rows in the DataFrame, repeat the list or tuple (R-style) to the end of the DataFrame.

False

Raises:

Type Description
ValueError

If attempting to add a column that already exists.

ValueError

If value has more elements that number of rows in the DataFrame.

ValueError

If attempting to add an iterable of values with a length not equal to the number of DataFrame rows.

ValueError

If value has length of 0.

Returns:

Type Description
DataFrame

A pandas DataFrame with an added column.

Source code in janitor/functions/add_columns.py
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
@pf.register_dataframe_method
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `pd.DataFrame.assign` instead."
    )
)
@deprecated_alias(col_name="column_name")
def add_column(
    df: pd.DataFrame,
    column_name: str,
    value: Union[List[Any], Tuple[Any], Any],
    fill_remaining: bool = False,
) -> pd.DataFrame:
    """Add a column to the dataframe.

    Intended to be the method-chaining alternative to:

    ```python
    df[column_name] = value
    ```

    !!!note

        This function will be deprecated in a 1.x release.
        Please use `pd.DataFrame.assign` instead.

    Examples:
        Add a column of constant values to the dataframe.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
        >>> df.add_column(column_name="c", value=1)
           a  b  c
        0  0  a  1
        1  1  b  1
        2  2  c  1

        Add a column of different values to the dataframe.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
        >>> df.add_column(column_name="c", value=list("efg"))
           a  b  c
        0  0  a  e
        1  1  b  f
        2  2  c  g

        Add a column using an iterator.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
        >>> df.add_column(column_name="c", value=range(4, 7))
           a  b  c
        0  0  a  4
        1  1  b  5
        2  2  c  6

    Args:
        df: A pandas DataFrame.
        column_name: Name of the new column. Should be a string, in order
            for the column name to be compatible with the Feather binary
            format (this is a useful thing to have).
        value: Either a single value, or a list/tuple of values.
        fill_remaining: If value is a tuple or list that is smaller than
            the number of rows in the DataFrame, repeat the list or tuple
            (R-style) to the end of the DataFrame.

    Raises:
        ValueError: If attempting to add a column that already exists.
        ValueError: If `value` has more elements that number of
            rows in the DataFrame.
        ValueError: If attempting to add an iterable of values with
            a length not equal to the number of DataFrame rows.
        ValueError: If `value` has length of `0`.

    Returns:
        A pandas DataFrame with an added column.
    """
    check("column_name", column_name, [str])

    if column_name in df.columns:
        raise ValueError(
            f"Attempted to add column that already exists: " f"{column_name}."
        )

    nrows = len(df)

    if hasattr(value, "__len__") and not isinstance(
        value, (str, bytes, bytearray)
    ):
        len_value = len(value)

        # if `value` is a list, ndarray, etc.
        if len_value > nrows:
            raise ValueError(
                "`value` has more elements than number of rows "
                f"in your `DataFrame`. vals: {len_value}, "
                f"df: {nrows}"
            )
        if len_value != nrows and not fill_remaining:
            raise ValueError(
                "Attempted to add iterable of values with length"
                " not equal to number of DataFrame rows"
            )
        if not len_value:
            raise ValueError(
                "`value` has to be an iterable of minimum length 1"
            )

    elif fill_remaining:
        # relevant if a scalar val was passed, yet fill_remaining == True
        len_value = 1
        value = [value]

    df = df.copy()
    if fill_remaining:
        times_to_loop = int(np.ceil(nrows / len_value))
        fill_values = list(value) * times_to_loop
        df[column_name] = fill_values[:nrows]
    else:
        df[column_name] = value

    return df

add_columns(df, fill_remaining=False, **kwargs)

Add multiple columns to the dataframe.

This method does not mutate the original DataFrame.

Method to augment add_column with ability to add multiple columns in one go. This replaces the need for multiple add_column calls.

Usage is through supplying kwargs where the key is the col name and the values correspond to the values of the new DataFrame column.

Values passed can be scalar or iterable (list, ndarray, etc.)

Note

This function will be deprecated in a 1.x release. Please use pd.DataFrame.assign instead.

Examples:

Inserting two more columns into a dataframe.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
>>> df.add_columns(x=4, y=list("def"))
   a  b  x  y
0  0  a  4  d
1  1  b  4  e
2  2  c  4  f

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
fill_remaining bool

If value is a tuple or list that is smaller than the number of rows in the DataFrame, repeat the list or tuple (R-style) to the end of the DataFrame. (Passed to add_column)

False
**kwargs Any

Column, value pairs which are looped through in add_column calls.

{}

Returns:

Type Description
DataFrame

A pandas DataFrame with added columns.

Source code in janitor/functions/add_columns.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
@pf.register_dataframe_method
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `pd.DataFrame.assign` instead."
    )
)
def add_columns(
    df: pd.DataFrame,
    fill_remaining: bool = False,
    **kwargs: Any,
) -> pd.DataFrame:
    """Add multiple columns to the dataframe.

    This method does not mutate the original DataFrame.

    Method to augment
    [`add_column`][janitor.functions.add_columns.add_column]
    with ability to add multiple columns in
    one go. This replaces the need for multiple
    [`add_column`][janitor.functions.add_columns.add_column] calls.

    Usage is through supplying kwargs where the key is the col name and the
    values correspond to the values of the new DataFrame column.

    Values passed can be scalar or iterable (list, ndarray, etc.)

    !!!note

        This function will be deprecated in a 1.x release.
        Please use `pd.DataFrame.assign` instead.

    Examples:
        Inserting two more columns into a dataframe.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
        >>> df.add_columns(x=4, y=list("def"))
           a  b  x  y
        0  0  a  4  d
        1  1  b  4  e
        2  2  c  4  f

    Args:
        df: A pandas DataFrame.
        fill_remaining: If value is a tuple or list that is smaller than
            the number of rows in the DataFrame, repeat the list or tuple
            (R-style) to the end of the DataFrame. (Passed to
            [`add_column`][janitor.functions.add_columns.add_column])
        **kwargs: Column, value pairs which are looped through in
            [`add_column`][janitor.functions.add_columns.add_column] calls.

    Returns:
        A pandas DataFrame with added columns.
    """
    # Note: error checking can pretty much be handled in `add_column`

    for col_name, values in kwargs.items():
        df = df.add_column(col_name, values, fill_remaining=fill_remaining)

    return df

also

Implementation source for chainable function also.

also(df, func, *args, **kwargs)

Run a function with side effects.

This function allows you to run an arbitrary function in the pyjanitor method chain. Doing so will let you do things like save the dataframe to disk midway while continuing to modify the dataframe afterwards.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = (
...     pd.DataFrame({"a": [1, 2, 3], "b": list("abc")})
...     .query("a > 1")
...     .also(lambda df: print(f"DataFrame shape is: {df.shape}"))
...     .rename_column(old_column_name="a", new_column_name="a_new")
...     .also(lambda df: df.to_csv("midpoint.csv"))
...     .also(
...         lambda df: print(f"Columns: {df.columns}")
...     )
... )
DataFrame shape is: (2, 2)
Columns: Index(['a_new', 'b'], dtype='object')

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
func Callable

A function you would like to run in the method chain. It should take one DataFrame object as a parameter and have no return. If there is a return, it will be ignored.

required
*args Any

Optional arguments for func.

()
**kwargs Any

Optional keyword arguments for func.

{}

Returns:

Type Description
DataFrame

The input pandas DataFrame, unmodified.

Source code in janitor/functions/also.py
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
@pf.register_dataframe_method
def also(
    df: pd.DataFrame, func: Callable, *args: Any, **kwargs: Any
) -> pd.DataFrame:
    """Run a function with side effects.

    This function allows you to run an arbitrary function
    in the `pyjanitor` method chain.
    Doing so will let you do things like save the dataframe to disk midway
    while continuing to modify the dataframe afterwards.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = (
        ...     pd.DataFrame({"a": [1, 2, 3], "b": list("abc")})
        ...     .query("a > 1")
        ...     .also(lambda df: print(f"DataFrame shape is: {df.shape}"))
        ...     .rename_column(old_column_name="a", new_column_name="a_new")
        ...     .also(lambda df: df.to_csv("midpoint.csv"))
        ...     .also(
        ...         lambda df: print(f"Columns: {df.columns}")
        ...     )
        ... )
        DataFrame shape is: (2, 2)
        Columns: Index(['a_new', 'b'], dtype='object')

    Args:
        df: A pandas DataFrame.
        func: A function you would like to run in the method chain.
            It should take one DataFrame object as a parameter and have no return.
            If there is a return, it will be ignored.
        *args: Optional arguments for `func`.
        **kwargs: Optional keyword arguments for `func`.

    Returns:
        The input pandas DataFrame, unmodified.
    """  # noqa: E501
    func(df.copy(), *args, **kwargs)
    return df

bin_numeric

Implementation source for bin_numeric.

bin_numeric(df, from_column_name, to_column_name, bins=5, **kwargs)

Generate a new column that labels bins for a specified numeric column.

This method does not mutate the original DataFrame.

A wrapper around the pandas cut() function to bin data of one column, generating a new column with the results.

Examples:

Binning a numeric column with specific bin edges.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": [3, 6, 9, 12, 15]})
>>> df.bin_numeric(
...     from_column_name="a", to_column_name="a_binned",
...     bins=[0, 5, 11, 15],
... )
    a  a_binned
0   3    (0, 5]
1   6   (5, 11]
2   9   (5, 11]
3  12  (11, 15]
4  15  (11, 15]

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
from_column_name str

The column whose data you want binned.

required
to_column_name str

The new column to be created with the binned data.

required
bins Optional[Union[int, ScalarSequence, IntervalIndex]]

The binning strategy to be utilized. Read the pd.cut documentation for more details.

5
**kwargs Any

Additional kwargs to pass to pd.cut, except retbins.

{}

Raises:

Type Description
ValueError

If retbins is passed in as a kwarg.

Returns:

Type Description
DataFrame

A pandas DataFrame.

Source code in janitor/functions/bin_numeric.py
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
@pf.register_dataframe_method
@deprecated_alias(
    from_column="from_column_name",
    to_column="to_column_name",
    num_bins="bins",
)
def bin_numeric(
    df: pd.DataFrame,
    from_column_name: str,
    to_column_name: str,
    bins: Optional[Union[int, ScalarSequence, pd.IntervalIndex]] = 5,
    **kwargs: Any,
) -> pd.DataFrame:
    """Generate a new column that labels bins for a specified numeric column.

    This method does not mutate the original DataFrame.

    A wrapper around the pandas [`cut()`][pd_cut_docs] function to bin data of
    one column, generating a new column with the results.

    [pd_cut_docs]: https://pandas.pydata.org/docs/reference/api/pandas.cut.html

    Examples:
        Binning a numeric column with specific bin edges.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"a": [3, 6, 9, 12, 15]})
        >>> df.bin_numeric(
        ...     from_column_name="a", to_column_name="a_binned",
        ...     bins=[0, 5, 11, 15],
        ... )
            a  a_binned
        0   3    (0, 5]
        1   6   (5, 11]
        2   9   (5, 11]
        3  12  (11, 15]
        4  15  (11, 15]

    Args:
        df: A pandas DataFrame.
        from_column_name: The column whose data you want binned.
        to_column_name: The new column to be created with the binned data.
        bins: The binning strategy to be utilized. Read the `pd.cut`
            documentation for more details.
        **kwargs: Additional kwargs to pass to `pd.cut`, except `retbins`.

    Raises:
        ValueError: If `retbins` is passed in as a kwarg.

    Returns:
        A pandas DataFrame.
    """
    if "retbins" in kwargs:
        raise ValueError("`retbins` is not an acceptable keyword argument.")

    check("from_column_name", from_column_name, [str])
    check("to_column_name", to_column_name, [str])
    check_column(df, from_column_name)

    df = df.assign(
        **{
            to_column_name: pd.cut(df[from_column_name], bins=bins, **kwargs),
        }
    )

    return df

case_when

Implementation source for case_when.

case_when(df, *args, default=None, column_name)

Create a column based on a condition or multiple conditions.

Similar to SQL and dplyr's case_when with inspiration from pydatatable if_else function.

If your scenario requires direct replacement of values, pandas' replace method or map method should be better suited and more efficient; if the conditions check if a value is within a range of values, pandas' cut or qcut should be more efficient; np.where/np.select are also performant options.

This function relies on pd.Series.mask method.

When multiple conditions are satisfied, the first one is used.

The variable *args parameters takes arguments of the form : condition0, value0, condition1, value1, ..., default. If condition0 evaluates to True, then assign value0 to column_name, if condition1 evaluates to True, then assign value1 to column_name, and so on. If none of the conditions evaluate to True, assign default to column_name.

This function can be likened to SQL's case_when:

CASE WHEN condition0 THEN value0
    WHEN condition1 THEN value1
    --- more conditions
    ELSE default
    END AS column_name

compared to python's if-elif-else:

if condition0:
    value0
elif condition1:
    value1
# more elifs
else:
    default

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame(
...     {
...         "a": [0, 0, 1, 2, "hi"],
...         "b": [0, 3, 4, 5, "bye"],
...         "c": [6, 7, 8, 9, "wait"],
...     }
... )
>>> df
    a    b     c
0   0    0     6
1   0    3     7
2   1    4     8
3   2    5     9
4  hi  bye  wait
>>> df.case_when(
...     ((df.a == 0) & (df.b != 0)) | (df.c == "wait"), df.a,
...     (df.b == 0) & (df.a == 0), "x",
...     default = df.c,
...     column_name = "value",
... )
    a    b     c value
0   0    0     6     x
1   0    3     7     0
2   1    4     8     8
3   2    5     9     9
4  hi  bye  wait    hi

Version Changed

  • 0.24.0
    • Added default parameter.

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
*args Any

Variable argument of conditions and expected values. Takes the form condition0, value0, condition1, value1, ... . condition can be a 1-D boolean array, a callable, or a string. If condition is a callable, it should evaluate to a 1-D boolean array. The array should have the same length as the DataFrame. If it is a string, it is computed on the dataframe, via df.eval, and should return a 1-D boolean array. result can be a scalar, a 1-D array, or a callable. If result is a callable, it should evaluate to a 1-D array. For a 1-D array, it should have the same length as the DataFrame.

()
default Any

This is the element inserted in the output when all conditions evaluate to False. Can be scalar, 1-D array or callable. If callable, it should evaluate to a 1-D array. The 1-D array should be the same length as the DataFrame.

None
column_name str

Name of column to assign results to. A new column is created if it does not already exist in the DataFrame.

required

Raises:

Type Description
ValueError

If condition/value fails to evaluate.

Returns:

Type Description
DataFrame

A pandas DataFrame.

Source code in janitor/functions/case_when.py
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
@pf.register_dataframe_method
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `pd.Series.case_when` instead."
    )
)
def case_when(
    df: pd.DataFrame, *args: Any, default: Any = None, column_name: str
) -> pd.DataFrame:
    """Create a column based on a condition or multiple conditions.

    Similar to SQL and dplyr's case_when
    with inspiration from `pydatatable` if_else function.

    If your scenario requires direct replacement of values,
    pandas' `replace` method or `map` method should be better
    suited and more efficient; if the conditions check
    if a value is within a range of values, pandas' `cut` or `qcut`
    should be more efficient; `np.where/np.select` are also
    performant options.

    This function relies on `pd.Series.mask` method.

    When multiple conditions are satisfied, the first one is used.

    The variable `*args` parameters takes arguments of the form :
    `condition0`, `value0`, `condition1`, `value1`, ..., `default`.
    If `condition0` evaluates to `True`, then assign `value0` to
    `column_name`, if `condition1` evaluates to `True`, then
    assign `value1` to `column_name`, and so on. If none of the
    conditions evaluate to `True`, assign `default` to
    `column_name`.

    This function can be likened to SQL's `case_when`:

    ```sql
    CASE WHEN condition0 THEN value0
        WHEN condition1 THEN value1
        --- more conditions
        ELSE default
        END AS column_name
    ```

    compared to python's `if-elif-else`:

    ```python
    if condition0:
        value0
    elif condition1:
        value1
    # more elifs
    else:
        default
    ```

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame(
        ...     {
        ...         "a": [0, 0, 1, 2, "hi"],
        ...         "b": [0, 3, 4, 5, "bye"],
        ...         "c": [6, 7, 8, 9, "wait"],
        ...     }
        ... )
        >>> df
            a    b     c
        0   0    0     6
        1   0    3     7
        2   1    4     8
        3   2    5     9
        4  hi  bye  wait
        >>> df.case_when(
        ...     ((df.a == 0) & (df.b != 0)) | (df.c == "wait"), df.a,
        ...     (df.b == 0) & (df.a == 0), "x",
        ...     default = df.c,
        ...     column_name = "value",
        ... )
            a    b     c value
        0   0    0     6     x
        1   0    3     7     0
        2   1    4     8     8
        3   2    5     9     9
        4  hi  bye  wait    hi

    !!! abstract "Version Changed"

        - 0.24.0
            - Added `default` parameter.

    Args:
        df: A pandas DataFrame.
        *args: Variable argument of conditions and expected values.
            Takes the form
            `condition0`, `value0`, `condition1`, `value1`, ... .
            `condition` can be a 1-D boolean array, a callable, or a string.
            If `condition` is a callable, it should evaluate
            to a 1-D boolean array. The array should have the same length
            as the DataFrame. If it is a string, it is computed on the dataframe,
            via `df.eval`, and should return a 1-D boolean array.
            `result` can be a scalar, a 1-D array, or a callable.
            If `result` is a callable, it should evaluate to a 1-D array.
            For a 1-D array, it should have the same length as the DataFrame.
        default: This is the element inserted in the output
            when all conditions evaluate to False.
            Can be scalar, 1-D array or callable.
            If callable, it should evaluate to a 1-D array.
            The 1-D array should be the same length as the DataFrame.
        column_name: Name of column to assign results to. A new column
            is created if it does not already exist in the DataFrame.

    Raises:
        ValueError: If condition/value fails to evaluate.

    Returns:
        A pandas DataFrame.
    """  # noqa: E501
    # Preliminary checks on the case_when function.
    # The bare minimum checks are done; the remaining checks
    # are done within `pd.Series.mask`.
    check("column_name", column_name, [str])
    len_args = len(args)
    if len_args < 2:
        raise ValueError(
            "At least two arguments are required for the `args` parameter"
        )

    if len_args % 2:
        if default is None:
            warnings.warn(
                "The last argument in the variable arguments "
                "has been assigned as the default. "
                "Note however that this will be deprecated "
                "in a future release; use an even number "
                "of boolean conditions and values, "
                "and pass the default argument to the `default` "
                "parameter instead.",
                DeprecationWarning,
                stacklevel=find_stack_level(),
            )
            *args, default = args
        else:
            raise ValueError(
                "The number of conditions and values do not match. "
                f"There are {len_args - len_args//2} conditions "
                f"and {len_args//2} values."
            )

    booleans = []
    replacements = []

    for index, value in enumerate(args):
        if index % 2:
            if callable(value):
                value = apply_if_callable(value, df)
            replacements.append(value)
        else:
            if callable(value):
                value = apply_if_callable(value, df)
            elif isinstance(value, str):
                value = df.eval(value)
            booleans.append(value)

    if callable(default):
        default = apply_if_callable(default, df)
    if is_scalar(default):
        default = pd.Series([default]).repeat(len(df))
    if not hasattr(default, "shape"):
        default = pd.Series([*default])
    if isinstance(default, pd.Index):
        arr_ndim = default.nlevels
    else:
        arr_ndim = default.ndim
    if arr_ndim != 1:
        raise ValueError(
            "The argument for the `default` parameter "
            "should either be a 1-D array, a scalar, "
            "or a callable that can evaluate to a 1-D array."
        )
    if not isinstance(default, pd.Series):
        default = pd.Series(default)
    default.index = df.index
    # actual computation
    # ensures value assignment is on a first come basis
    booleans = booleans[::-1]
    replacements = replacements[::-1]
    for index, (condition, value) in enumerate(zip(booleans, replacements)):
        try:
            default = default.mask(condition, value)
        # error `feedoff` idea from SO
        # https://stackoverflow.com/a/46091127/7175713
        except Exception as error:
            raise ValueError(
                f"condition{index} and value{index} failed to evaluate. "
                f"Original error message: {error}"
            ) from error

    return df.assign(**{column_name: default})

change_index_dtype

Implementation of the change_index_dtype function.

change_index_dtype(df, dtype, axis='index')

Cast an index to a specified dtype dtype.

This method does not mutate the original DataFrame.

Examples:

>>> import pandas as pd
>>> import numpy as np
>>> import janitor
>>> rng = np.random.default_rng(seed=0)
>>> np.random.seed(0)
>>> tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
...             'foo', 'foo', 'qux', 'qux'],
...              [1.0, 2.0, 1.0, 2.0,
...               1.0, 2.0, 1.0, 2.0]]))
>>> idx = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
>>> df = pd.DataFrame(np.random.randn(8, 2), index=idx, columns=['A', 'B'])
>>> df
                     A         B
first second
bar   1.0     1.764052  0.400157
      2.0     0.978738  2.240893
baz   1.0     1.867558 -0.977278
      2.0     0.950088 -0.151357
foo   1.0    -0.103219  0.410599
      2.0     0.144044  1.454274
qux   1.0     0.761038  0.121675
      2.0     0.443863  0.333674
>>> outcome=df.change_index_dtype(dtype=str)
>>> outcome
                     A         B
first second
bar   1.0     1.764052  0.400157
      2.0     0.978738  2.240893
baz   1.0     1.867558 -0.977278
      2.0     0.950088 -0.151357
foo   1.0    -0.103219  0.410599
      2.0     0.144044  1.454274
qux   1.0     0.761038  0.121675
      2.0     0.443863  0.333674
>>> outcome.index.dtypes
first     object
second    object
dtype: object
>>> outcome=df.change_index_dtype(dtype={'second':int})
>>> outcome
                     A         B
first second
bar   1       1.764052  0.400157
      2       0.978738  2.240893
baz   1       1.867558 -0.977278
      2       0.950088 -0.151357
foo   1      -0.103219  0.410599
      2       0.144044  1.454274
qux   1       0.761038  0.121675
      2       0.443863  0.333674
>>> outcome.index.dtypes
first     object
second     int64
dtype: object
>>> outcome=df.change_index_dtype(dtype={0:'category',1:int})
>>> outcome
                     A         B
first second
bar   1       1.764052  0.400157
      2       0.978738  2.240893
baz   1       1.867558 -0.977278
      2       0.950088 -0.151357
foo   1      -0.103219  0.410599
      2       0.144044  1.454274
qux   1       0.761038  0.121675
      2       0.443863  0.333674
>>> outcome.index.dtypes
first     category
second       int64
dtype: object

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
dtype

Use a str or dtype to cast the entire Index to the same type. Alternatively, use a dictionary to change the MultiIndex to new dtypes.

required
axis str

Determines which axis to change the dtype(s). Should be either 'index' or 'columns'.

'index'

Returns:

Type Description
DataFrame

A pandas DataFrame with new Index.

Source code in janitor/functions/change_index_dtype.py
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
@pf.register_dataframe_method
def change_index_dtype(
    df: pd.DataFrame, dtype: Union[str, dict], axis: str = "index"
) -> pd.DataFrame:
    """Cast an index to a specified dtype ``dtype``.

    This method does not mutate the original DataFrame.

    Examples:
        >>> import pandas as pd
        >>> import numpy as np
        >>> import janitor
        >>> rng = np.random.default_rng(seed=0)
        >>> np.random.seed(0)
        >>> tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
        ...             'foo', 'foo', 'qux', 'qux'],
        ...              [1.0, 2.0, 1.0, 2.0,
        ...               1.0, 2.0, 1.0, 2.0]]))
        >>> idx = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
        >>> df = pd.DataFrame(np.random.randn(8, 2), index=idx, columns=['A', 'B'])
        >>> df
                             A         B
        first second
        bar   1.0     1.764052  0.400157
              2.0     0.978738  2.240893
        baz   1.0     1.867558 -0.977278
              2.0     0.950088 -0.151357
        foo   1.0    -0.103219  0.410599
              2.0     0.144044  1.454274
        qux   1.0     0.761038  0.121675
              2.0     0.443863  0.333674
        >>> outcome=df.change_index_dtype(dtype=str)
        >>> outcome
                             A         B
        first second
        bar   1.0     1.764052  0.400157
              2.0     0.978738  2.240893
        baz   1.0     1.867558 -0.977278
              2.0     0.950088 -0.151357
        foo   1.0    -0.103219  0.410599
              2.0     0.144044  1.454274
        qux   1.0     0.761038  0.121675
              2.0     0.443863  0.333674
        >>> outcome.index.dtypes
        first     object
        second    object
        dtype: object
        >>> outcome=df.change_index_dtype(dtype={'second':int})
        >>> outcome
                             A         B
        first second
        bar   1       1.764052  0.400157
              2       0.978738  2.240893
        baz   1       1.867558 -0.977278
              2       0.950088 -0.151357
        foo   1      -0.103219  0.410599
              2       0.144044  1.454274
        qux   1       0.761038  0.121675
              2       0.443863  0.333674
        >>> outcome.index.dtypes
        first     object
        second     int64
        dtype: object
        >>> outcome=df.change_index_dtype(dtype={0:'category',1:int})
        >>> outcome
                             A         B
        first second
        bar   1       1.764052  0.400157
              2       0.978738  2.240893
        baz   1       1.867558 -0.977278
              2       0.950088 -0.151357
        foo   1      -0.103219  0.410599
              2       0.144044  1.454274
        qux   1       0.761038  0.121675
              2       0.443863  0.333674
        >>> outcome.index.dtypes
        first     category
        second       int64
        dtype: object

    Args:
        df: A pandas DataFrame.
        dtype : Use a str or dtype to cast the entire Index
            to the same type.
            Alternatively, use a dictionary to change the MultiIndex
            to new dtypes.
        axis: Determines which axis to change the dtype(s).
            Should be either 'index' or 'columns'.

    Returns:
        A pandas DataFrame with new Index.
    """  # noqa: E501

    check("axis", axis, [str])
    if axis not in {"index", "columns"}:
        raise ValueError("axis should be either index or columns.")

    df = df[:]
    current_index = getattr(df, axis)
    if not isinstance(current_index, pd.MultiIndex):
        if isinstance(dtype, dict):
            raise TypeError(
                "Changing the dtype via a dictionary "
                "is not supported for a single index."
            )
        current_index = current_index.astype(dtype)
        setattr(df, axis, current_index)
        return df

    if not isinstance(dtype, dict):
        dtype = {
            level_number: dtype
            for level_number in range(current_index.nlevels)
        }

    all_str = all(isinstance(level, str) for level in dtype)
    all_int = all(isinstance(level, int) for level in dtype)
    if not all_str | all_int:
        raise TypeError(
            "The levels in the dictionary "
            "should be either all strings or all integers."
        )

    dtype = {
        current_index._get_level_number(label): _dtype
        for label, _dtype in dtype.items()
    }

    new_levels = []
    codes = current_index.codes
    levels = current_index.levels

    for level_number in range(current_index.nlevels):
        _index = levels[level_number]
        if level_number in dtype:
            _dtype = dtype[level_number]
            _index = _index.astype(_dtype)
        new_levels.append(_index)

    current_index = pd.MultiIndex(
        levels=new_levels,
        codes=codes,
        names=current_index.names,
        copy=False,
        verify_integrity=False,
    )
    setattr(df, axis, current_index)
    return df

change_type

change_type(df, column_name, dtype, ignore_exception=False)

Change the type of a column.

This method does not mutate the original DataFrame.

Exceptions that are raised can be ignored. For example, if one has a mixed dtype column that has non-integer strings and integers, and you want to coerce everything to integers, you can optionally ignore the non-integer strings and replace them with NaN or keep the original value.

Intended to be the method-chaining alternative to:

df[col] = df[col].astype(dtype)

Note

This function will be deprecated in a 1.x release. Please use pd.DataFrame.astype instead.

Examples:

Change the type of a column.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"col1": range(3), "col2": ["m", 5, True]})
>>> df
   col1  col2
0     0     m
1     1     5
2     2  True
>>> df.change_type(
...     "col1", dtype=str,
... ).change_type(
...     "col2", dtype=float, ignore_exception="fillna",
... )
  col1  col2
0    0   NaN
1    1   5.0
2    2   1.0

Change the type of multiple columns. To change the type of all columns, please use DataFrame.astype instead.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"col1": range(3), "col2": ["m", 5, True]})
>>> df.change_type(['col1', 'col2'], str)
  col1  col2
0    0     m
1    1     5
2    2  True

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_name Hashable | list[Hashable] | Index

The column(s) in the dataframe.

required
dtype type

The datatype to convert to. Should be one of the standard Python types, or a numpy datatype.

required
ignore_exception bool

One of {False, "fillna", "keep_values"}.

False

Raises:

Type Description
ValueError

If unknown option provided for ignore_exception.

Returns:

Type Description
DataFrame

A pandas DataFrame with changed column types.

Source code in janitor/functions/change_type.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
@pf.register_dataframe_method
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `pd.DataFrame.astype` instead."
    )
)
@deprecated_alias(column="column_name")
def change_type(
    df: pd.DataFrame,
    column_name: Hashable | list[Hashable] | pd.Index,
    dtype: type,
    ignore_exception: bool = False,
) -> pd.DataFrame:
    """Change the type of a column.

    This method does not mutate the original DataFrame.

    Exceptions that are raised can be ignored. For example, if one has a mixed
    dtype column that has non-integer strings and integers, and you want to
    coerce everything to integers, you can optionally ignore the non-integer
    strings and replace them with `NaN` or keep the original value.

    Intended to be the method-chaining alternative to:

    ```python
    df[col] = df[col].astype(dtype)
    ```

    !!!note

        This function will be deprecated in a 1.x release.
        Please use `pd.DataFrame.astype` instead.

    Examples:
        Change the type of a column.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"col1": range(3), "col2": ["m", 5, True]})
        >>> df
           col1  col2
        0     0     m
        1     1     5
        2     2  True
        >>> df.change_type(
        ...     "col1", dtype=str,
        ... ).change_type(
        ...     "col2", dtype=float, ignore_exception="fillna",
        ... )
          col1  col2
        0    0   NaN
        1    1   5.0
        2    2   1.0

        Change the type of multiple columns. To change the type of all columns,
        please use `DataFrame.astype` instead.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"col1": range(3), "col2": ["m", 5, True]})
        >>> df.change_type(['col1', 'col2'], str)
          col1  col2
        0    0     m
        1    1     5
        2    2  True

    Args:
        df: A pandas DataFrame.
        column_name: The column(s) in the dataframe.
        dtype: The datatype to convert to. Should be one of the standard
            Python types, or a numpy datatype.
        ignore_exception: One of `{False, "fillna", "keep_values"}`.

    Raises:
        ValueError: If unknown option provided for `ignore_exception`.

    Returns:
        A pandas DataFrame with changed column types.
    """  # noqa: E501

    df = df.copy()  # avoid mutating the original DataFrame
    if not ignore_exception:
        df[column_name] = df[column_name].astype(dtype)
    elif ignore_exception == "keep_values":
        df[column_name] = df[column_name].astype(dtype, errors="ignore")
    elif ignore_exception == "fillna":
        if isinstance(column_name, Hashable):
            column_name = [column_name]
        df[column_name] = df[column_name].map(_convert, dtype=dtype)
    else:
        raise ValueError("Unknown option for ignore_exception")

    return df

clean_names

Functions for cleaning columns names.

clean_names(df, axis='columns', column_names=None, strip_underscores=None, case_type='lower', remove_special=False, strip_accents=True, preserve_original_labels=True, enforce_string=True, truncate_limit=None)

Clean column/index names. It can also be applied to column values.

Takes all column names, converts them to lowercase, then replaces all spaces with underscores.

By default, column names are converted to string types. This can be switched off by passing in enforce_string=False.

This method does not mutate the original DataFrame.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame(
...     {
...         "Aloha": range(3),
...         "Bell Chart": range(3),
...         "Animals@#$%^": range(3)
...     }
... )
>>> df
   Aloha  Bell Chart  Animals@#$%^
0      0           0             0
1      1           1             1
2      2           2             2
>>> df.clean_names()
   aloha  bell_chart  animals@#$%^
0      0           0             0
1      1           1             1
2      2           2             2
>>> df.clean_names(remove_special=True)
   aloha  bell_chart  animals
0      0           0        0
1      1           1        1
2      2           2        2

Version Changed

  • 0.26.0
    • Added axis and column_names parameters.

Parameters:

Name Type Description Default
df DataFrame

The pandas DataFrame object.

required
axis Union[str, None]

Whether to clean the labels on the index or columns. If None, applies to a defined column or columns in column_names.

'columns'
column_names Union[str, list]

Clean the values in a column. axis should be None. Column selection is possible using the select syntax.

None
strip_underscores Optional[Union[str, bool]]

Removes the outer underscores from all column names. Default None keeps outer underscores. Values can be either 'left', 'right' or 'both' or the respective shorthand 'l', 'r' and True.

None
case_type str

Whether to make columns lower or uppercase. Current case may be preserved with 'preserve', while snake case conversion (from CamelCase or camelCase only) can be turned on using "snake". Default 'lower' makes all characters lowercase.

'lower'
remove_special bool

Remove special characters from columns. Only letters, numbers and underscores are preserved.

False
strip_accents bool

Whether or not to remove accents from columns names.

True
preserve_original_labels bool

Preserve original names. This is later retrievable using df.original_labels. Applies if axis is not None.

True
enforce_string bool

Whether or not to convert all column names to string type. Defaults to True, but can be turned off. Columns with >1 levels will not be converted by default.

True
truncate_limit int

Truncates formatted column names to the specified length. Default None does not truncate.

None

Raises:

Type Description
ValueError

If axis=None and column_names=None.

Returns:

Type Description
DataFrame

A pandas DataFrame.

Source code in janitor/functions/clean_names.py
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
@pf.register_dataframe_method
@deprecated_alias(preserve_original_columns="preserve_original_labels")
def clean_names(
    df: pd.DataFrame,
    axis: Union[str, None] = "columns",
    column_names: Union[str, list] = None,
    strip_underscores: Optional[Union[str, bool]] = None,
    case_type: str = "lower",
    remove_special: bool = False,
    strip_accents: bool = True,
    preserve_original_labels: bool = True,
    enforce_string: bool = True,
    truncate_limit: int = None,
) -> pd.DataFrame:
    """Clean column/index names. It can also be applied to column values.

    Takes all column names, converts them to lowercase,
    then replaces all spaces with underscores.

    By default, column names are converted to string types.
    This can be switched off by passing in `enforce_string=False`.

    This method does not mutate the original DataFrame.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame(
        ...     {
        ...         "Aloha": range(3),
        ...         "Bell Chart": range(3),
        ...         "Animals@#$%^": range(3)
        ...     }
        ... )
        >>> df
           Aloha  Bell Chart  Animals@#$%^
        0      0           0             0
        1      1           1             1
        2      2           2             2
        >>> df.clean_names()
           aloha  bell_chart  animals@#$%^
        0      0           0             0
        1      1           1             1
        2      2           2             2
        >>> df.clean_names(remove_special=True)
           aloha  bell_chart  animals
        0      0           0        0
        1      1           1        1
        2      2           2        2

    !!! summary "Version Changed"

        - 0.26.0
             - Added `axis` and `column_names` parameters.

    Args:
        df: The pandas DataFrame object.
        axis: Whether to clean the labels on the index or columns.
            If `None`, applies to a defined column
            or columns in `column_names`.
        column_names: Clean the values in a column.
            `axis` should be `None`.
            Column selection is possible using the
            [`select`][janitor.functions.select.select] syntax.
        strip_underscores: Removes the outer underscores from all
            column names. Default None keeps outer underscores. Values can be
            either 'left', 'right' or 'both' or the respective shorthand 'l',
            'r' and True.
        case_type: Whether to make columns lower or uppercase.
            Current case may be preserved with 'preserve',
            while snake case conversion (from CamelCase or camelCase only)
            can be turned on using "snake".
            Default 'lower' makes all characters lowercase.
        remove_special: Remove special characters from columns.
            Only letters, numbers and underscores are preserved.
        strip_accents: Whether or not to remove accents from
            columns names.
        preserve_original_labels: Preserve original names.
            This is later retrievable using `df.original_labels`.
            Applies if `axis` is not None.
        enforce_string: Whether or not to convert all column names
            to string type. Defaults to True, but can be turned off.
            Columns with >1 levels will not be converted by default.
        truncate_limit: Truncates formatted column names to
            the specified length. Default None does not truncate.

    Raises:
        ValueError: If `axis=None` and `column_names=None`.

    Returns:
        A pandas DataFrame.
    """
    if not axis and not column_names:
        raise ValueError(
            "Kindly provide an argument to `column_names`, if axis is None."
        )
    if axis is None:
        column_names = get_index_labels(
            arg=column_names, df=df, axis="columns"
        )
        if is_scalar(column_names):
            column_names = [column_names]
        df = df.copy()
        for column_name in column_names:
            df[column_name] = _clean_names_single_object(
                obj=df[column_name],
                enforce_string=enforce_string,
                case_type=case_type,
                remove_special=remove_special,
                strip_accents=strip_accents,
                strip_underscores=strip_underscores,
                truncate_limit=truncate_limit,
            )
        return df

    assert axis in {"index", "columns"}
    df = df[:]
    target_axis = getattr(df, axis)
    if isinstance(target_axis, pd.MultiIndex):
        target_axis = [
            target_axis.get_level_values(number)
            for number in range(target_axis.nlevels)
        ]
        target_axis = [
            _clean_names_single_object(
                obj=obj,
                enforce_string=enforce_string,
                case_type=case_type,
                remove_special=remove_special,
                strip_accents=strip_accents,
                strip_underscores=strip_underscores,
                truncate_limit=truncate_limit,
            )
            for obj in target_axis
        ]
    else:
        target_axis = _clean_names_single_object(
            obj=target_axis,
            enforce_string=enforce_string,
            case_type=case_type,
            remove_special=remove_special,
            strip_accents=strip_accents,
            strip_underscores=strip_underscores,
            truncate_limit=truncate_limit,
        )
    # Store the original column names, if enabled by user
    if preserve_original_labels:
        df.__dict__["original_labels"] = getattr(df, axis)
    setattr(df, axis, target_axis)
    return df

coalesce

Function for performing coalesce.

coalesce(df, *column_names, target_column_name=None, default_value=None)

Coalesce two or more columns of data in order of column names provided.

Given the variable arguments of column names, coalesce finds and returns the first non-missing value from these columns, for every row in the input dataframe. If all the column values are null for a particular row, then the default_value will be filled in.

If target_column_name is not provided, then the first column is coalesced.

This method does not mutate the original DataFrame.

The select syntax can be used in column_names.

Examples:

Use coalesce with 3 columns, "a", "b" and "c".

>>> import pandas as pd
>>> import numpy as np
>>> import janitor
>>> df = pd.DataFrame({
...     "a": [np.nan, 1, np.nan],
...     "b": [2, 3, np.nan],
...     "c": [4, np.nan, np.nan],
... })
>>> df.coalesce("a", "b", "c")
     a    b    c
0  2.0  2.0  4.0
1  1.0  3.0  NaN
2  NaN  NaN  NaN

Provide a target_column_name.

>>> df.coalesce("a", "b", "c", target_column_name="new_col")
     a    b    c  new_col
0  NaN  2.0  4.0      2.0
1  1.0  3.0  NaN      1.0
2  NaN  NaN  NaN      NaN

Provide a default value.

>>> import pandas as pd
>>> import numpy as np
>>> import janitor
>>> df = pd.DataFrame({
...     "a": [1, np.nan, np.nan],
...     "b": [2, 3, np.nan],
... })
>>> df.coalesce(
...     "a", "b",
...     target_column_name="new_col",
...     default_value=-1,
... )
     a    b  new_col
0  1.0  2.0      1.0
1  NaN  3.0      3.0
2  NaN  NaN     -1.0

This is more syntactic diabetes! For R users, this should look familiar to dplyr's coalesce function; for Python users, the interface should be more intuitive than the pandas.Series.combine_first method.

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_names Any

A list of column names.

()
target_column_name Optional[str]

The new column name after combining. If None, then the first column in column_names is updated, with the Null values replaced.

None
default_value Optional[Union[int, float, str]]

A scalar to replace any remaining nulls after coalescing.

None

Raises:

Type Description
ValueError

If length of column_names is less than 2.

Returns:

Type Description
DataFrame

A pandas DataFrame with coalesced columns.

Source code in janitor/functions/coalesce.py
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
@pf.register_dataframe_method
@deprecated_alias(columns="column_names", new_column_name="target_column_name")
def coalesce(
    df: pd.DataFrame,
    *column_names: Any,
    target_column_name: Optional[str] = None,
    default_value: Optional[Union[int, float, str]] = None,
) -> pd.DataFrame:
    """Coalesce two or more columns of data in order of column names provided.

    Given the variable arguments of column names,
    `coalesce` finds and returns the first non-missing value
    from these columns, for every row in the input dataframe.
    If all the column values are null for a particular row,
    then the `default_value` will be filled in.

    If `target_column_name` is not provided,
    then the first column is coalesced.

    This method does not mutate the original DataFrame.

    The [`select`][janitor.functions.select.select] syntax
    can be used in `column_names`.

    Examples:
        Use `coalesce` with 3 columns, "a", "b" and "c".

        >>> import pandas as pd
        >>> import numpy as np
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "a": [np.nan, 1, np.nan],
        ...     "b": [2, 3, np.nan],
        ...     "c": [4, np.nan, np.nan],
        ... })
        >>> df.coalesce("a", "b", "c")
             a    b    c
        0  2.0  2.0  4.0
        1  1.0  3.0  NaN
        2  NaN  NaN  NaN

        Provide a target_column_name.

        >>> df.coalesce("a", "b", "c", target_column_name="new_col")
             a    b    c  new_col
        0  NaN  2.0  4.0      2.0
        1  1.0  3.0  NaN      1.0
        2  NaN  NaN  NaN      NaN

        Provide a default value.

        >>> import pandas as pd
        >>> import numpy as np
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "a": [1, np.nan, np.nan],
        ...     "b": [2, 3, np.nan],
        ... })
        >>> df.coalesce(
        ...     "a", "b",
        ...     target_column_name="new_col",
        ...     default_value=-1,
        ... )
             a    b  new_col
        0  1.0  2.0      1.0
        1  NaN  3.0      3.0
        2  NaN  NaN     -1.0

    This is more syntactic diabetes! For R users, this should look familiar to
    `dplyr`'s `coalesce` function; for Python users, the interface
    should be more intuitive than the `pandas.Series.combine_first`
    method.

    Args:
        df: A pandas DataFrame.
        column_names: A list of column names.
        target_column_name: The new column name after combining.
            If `None`, then the first column in `column_names` is updated,
            with the Null values replaced.
        default_value: A scalar to replace any remaining nulls
            after coalescing.

    Raises:
        ValueError: If length of `column_names` is less than 2.

    Returns:
        A pandas DataFrame with coalesced columns.
    """

    if not column_names:
        return df

    indexers = _select_index([*column_names], df, axis="columns")

    if len(indexers) < 2:
        raise ValueError(
            "The number of columns to coalesce should be a minimum of 2."
        )

    if target_column_name:
        check("target_column_name", target_column_name, [str])

    if default_value:
        check("default_value", default_value, [int, float, str])

    df = df.copy()

    outcome = df.iloc[:, indexers[0]]

    for num in range(1, len(indexers)):
        position = indexers[num]
        replacement = df.iloc[:, position]
        outcome = outcome.fillna(replacement)

    if outcome.hasnans and (default_value is not None):
        outcome = outcome.fillna(default_value)

    if target_column_name is None:
        df.iloc[:, indexers[0]] = outcome
    else:
        df[target_column_name] = outcome

    return df

collapse_levels

Implementation of the collapse_levels function.

collapse_levels(df, sep=None, glue=None, axis='columns')

Flatten multi-level index/column dataframe to a single level.

This method does not mutate the original DataFrame.

Given a DataFrame containing multi-level index/columns, flatten to single-level by string-joining the labels in each level.

After a groupby / aggregate operation where .agg() is passed a list of multiple aggregation functions, a multi-level DataFrame is returned with the name of the function applied in the second level.

It is sometimes convenient for later indexing to flatten out this multi-level configuration back into a single level. This function does this through a simple string-joining of all the names across different levels in a single column.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "class": ["bird", "bird", "bird", "mammal", "mammal"],
...     "max_speed": [389, 389, 24, 80, 21],
...     "type": ["falcon", "falcon", "parrot", "Lion", "Monkey"],
... })
>>> df
    class  max_speed    type
0    bird        389  falcon
1    bird        389  falcon
2    bird         24  parrot
3  mammal         80    Lion
4  mammal         21  Monkey
>>> grouped_df = df.groupby("class")[['max_speed']].agg(["mean", "median"])
>>> grouped_df
         max_speed
              mean median
class
bird    267.333333  389.0
mammal   50.500000   50.5
>>> grouped_df.collapse_levels(sep="_")
        max_speed_mean  max_speed_median
class
bird        267.333333             389.0
mammal       50.500000              50.5

Before applying .collapse_levels, the .agg operation returns a multi-level column DataFrame whose columns are (level 1, level 2):

[("max_speed", "mean"), ("max_speed", "median")]

.collapse_levels then flattens the column MultiIndex into a single level index with names:

["max_speed_mean", "max_speed_median"]

For more control, a glue specification can be passed, where the names of the levels are used to control the output of the flattened index:

>>> (grouped_df
...  .rename_axis(columns=['column_name', 'agg_name'])
...  .collapse_levels(glue="{agg_name}_{column_name}")
... )
        mean_max_speed  median_max_speed
class
bird        267.333333             389.0
mammal       50.500000              50.5

Note that for glue to work, the keyword arguments in the glue specification should be the names of the levels in the MultiIndex.

Version Changed

  • 0.27.0
    • Added glue and axis parameters.

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
sep Union[str, None]

String separator used to join the column level names.

None
glue Union[str, None]

A specification on how the column levels should be combined. It allows for a more granular composition, and serves as an alternative to sep.

None
axis

Determines whether to collapse the levels on the index or columns.

'columns'

Returns:

Type Description
DataFrame

A pandas DataFrame with single-level column index.

Source code in janitor/functions/collapse_levels.py
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
@pf.register_dataframe_method
def collapse_levels(
    df: pd.DataFrame,
    sep: Union[str, None] = None,
    glue: Union[str, None] = None,
    axis="columns",
) -> pd.DataFrame:
    """Flatten multi-level index/column dataframe to a single level.

    This method does not mutate the original DataFrame.

    Given a DataFrame containing multi-level index/columns, flatten to single-level
    by string-joining the labels in each level.

    After a `groupby` / `aggregate` operation where `.agg()` is passed a
    list of multiple aggregation functions, a multi-level DataFrame is
    returned with the name of the function applied in the second level.

    It is sometimes convenient for later indexing to flatten out this
    multi-level configuration back into a single level. This function does
    this through a simple string-joining of all the names across different
    levels in a single column.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "class": ["bird", "bird", "bird", "mammal", "mammal"],
        ...     "max_speed": [389, 389, 24, 80, 21],
        ...     "type": ["falcon", "falcon", "parrot", "Lion", "Monkey"],
        ... })
        >>> df
            class  max_speed    type
        0    bird        389  falcon
        1    bird        389  falcon
        2    bird         24  parrot
        3  mammal         80    Lion
        4  mammal         21  Monkey
        >>> grouped_df = df.groupby("class")[['max_speed']].agg(["mean", "median"])
        >>> grouped_df  # doctest: +NORMALIZE_WHITESPACE
                 max_speed
                      mean median
        class
        bird    267.333333  389.0
        mammal   50.500000   50.5
        >>> grouped_df.collapse_levels(sep="_")  # doctest: +NORMALIZE_WHITESPACE
                max_speed_mean  max_speed_median
        class
        bird        267.333333             389.0
        mammal       50.500000              50.5

        Before applying `.collapse_levels`, the `.agg` operation returns a
        multi-level column DataFrame whose columns are `(level 1, level 2)`:

        ```python
        [("max_speed", "mean"), ("max_speed", "median")]
        ```

        `.collapse_levels` then flattens the column MultiIndex into a single
        level index with names:

        ```python
        ["max_speed_mean", "max_speed_median"]
        ```

        For more control, a `glue` specification can be passed,
        where the names of the levels are used to control the output of the
        flattened index:
        >>> (grouped_df
        ...  .rename_axis(columns=['column_name', 'agg_name'])
        ...  .collapse_levels(glue="{agg_name}_{column_name}")
        ... )
                mean_max_speed  median_max_speed
        class
        bird        267.333333             389.0
        mammal       50.500000              50.5

        Note that for `glue` to work, the keyword arguments
        in the glue specification
        should be the names of the levels in the MultiIndex.

    !!! abstract "Version Changed"

        - 0.27.0
            - Added `glue` and `axis` parameters.

    Args:
        df: A pandas DataFrame.
        sep: String separator used to join the column level names.
        glue: A specification on how the column levels should be combined.
            It allows for a more granular composition,
            and serves as an alternative to `sep`.
        axis: Determines whether to collapse the
            levels on the index or columns.

    Returns:
        A pandas DataFrame with single-level column index.
    """  # noqa: E501
    if (sep is not None) and (glue is not None):
        raise ValueError("Only one of sep or glue should be provided.")
    if sep is not None:
        check("sep", sep, [str])
    if glue is not None:
        check("glue", glue, [str])
    check("axis", axis, [str])
    if axis not in {"index", "columns"}:
        raise ValueError(
            "axis argument should be either 'index' or 'columns'."
        )

    if not isinstance(getattr(df, axis), pd.MultiIndex):
        return df

    # TODO: Pyarrow offers faster string computations
    # future work should take this into consideration,
    # which would require a different route from python's string.join
    # since work is only on the columns
    # it is safe, and more efficient to slice/view the dataframe
    # plus Pandas creates a new Index altogether
    # as such, the original dataframe is not modified
    df = df[:]
    new_index = getattr(df, axis)
    if glue is not None:
        new_index = [dict(zip(new_index.names, entry)) for entry in new_index]
        new_index = [glue.format_map(mapping) for mapping in new_index]
        setattr(df, axis, new_index)
        return df
    sep = "_" if sep is None else sep
    levels = [level for level in new_index.levels]
    all_strings = all(map(is_string_dtype, levels))
    if all_strings:
        no_empty_string = all((entry != "").all() for entry in levels)
        if no_empty_string:
            new_index = new_index.map(sep.join)
            setattr(df, axis, new_index)
            return df
    new_index = (map(str, entry) for entry in new_index)
    new_index = [
        # faster to use a list comprehension within string.join
        # compared to a generator
        # https://stackoverflow.com/a/37782238
        sep.join([entry for entry in word if entry])
        for word in new_index
    ]
    setattr(df, axis, new_index)
    return df

complete

complete(df, *columns, sort=False, by=None, fill_value=None, explicit=True)

Complete a data frame with missing combinations of data.

It is modeled after tidyr's complete function, and is a wrapper around expand_grid, pd.merge and pd.fillna. In a way, it is the inverse of pd.dropna, as it exposes implicitly missing rows.

Combinations of column names or a list/tuple of column names, or even a dictionary of column names and new values are possible. If a dictionary is passed, the user is required to ensure that the values are unique 1-D arrays. The keys in a dictionary must be present in the dataframe.

Examples:

>>> import pandas as pd
>>> import janitor
>>> import numpy as np
>>> df = pd.DataFrame(
...     {
...         "Year": [1999, 2000, 2004, 1999, 2004],
...         "Taxon": [
...             "Saccharina",
...             "Saccharina",
...             "Saccharina",
...             "Agarum",
...             "Agarum",
...         ],
...         "Abundance": [4, 5, 2, 1, 8],
...     }
... )
>>> df
   Year       Taxon  Abundance
0  1999  Saccharina          4
1  2000  Saccharina          5
2  2004  Saccharina          2
3  1999      Agarum          1
4  2004      Agarum          8

Expose missing pairings of Year and Taxon:

>>> df.complete("Year", "Taxon", sort=True)
   Year       Taxon  Abundance
0  1999      Agarum        1.0
1  1999  Saccharina        4.0
2  2000      Agarum        NaN
3  2000  Saccharina        5.0
4  2004      Agarum        8.0
5  2004  Saccharina        2.0

Expose missing years from 1999 to 2004:

>>> df.complete(
...     {"Year": range(df.Year.min(), df.Year.max() + 1)},
...     "Taxon",
...     sort=True
... )
    Year       Taxon  Abundance
0   1999      Agarum        1.0
1   1999  Saccharina        4.0
2   2000      Agarum        NaN
3   2000  Saccharina        5.0
4   2001      Agarum        NaN
5   2001  Saccharina        NaN
6   2002      Agarum        NaN
7   2002  Saccharina        NaN
8   2003      Agarum        NaN
9   2003  Saccharina        NaN
10  2004      Agarum        8.0
11  2004  Saccharina        2.0

Fill missing values:

>>> df = pd.DataFrame(
...     dict(
...         group=(1, 2, 1, 2),
...         item_id=(1, 2, 2, 3),
...         item_name=("a", "a", "b", "b"),
...         value1=(1, np.nan, 3, 4),
...         value2=range(4, 8),
...     )
... )
>>> df
   group  item_id item_name  value1  value2
0      1        1         a     1.0       4
1      2        2         a     NaN       5
2      1        2         b     3.0       6
3      2        3         b     4.0       7
>>> df.complete(
...     "group",
...     ("item_id", "item_name"),
...     fill_value={"value1": 0, "value2": 99},
...     sort=True
... )
   group  item_id item_name  value1  value2
0      1        1         a     1.0     4.0
1      1        2         a     0.0    99.0
2      1        2         b     3.0     6.0
3      1        3         b     0.0    99.0
4      2        1         a     0.0    99.0
5      2        2         a     0.0     5.0
6      2        2         b     0.0    99.0
7      2        3         b     4.0     7.0

Limit the fill to only implicit missing values by setting explicit to False:

>>> df.complete(
...     "group",
...     ("item_id", "item_name"),
...     fill_value={"value1": 0, "value2": 99},
...     explicit=False,
...     sort=True
... )
   group  item_id item_name  value1  value2
0      1        1         a     1.0     4.0
1      1        2         a     0.0    99.0
2      1        2         b     3.0     6.0
3      1        3         b     0.0    99.0
4      2        1         a     0.0    99.0
5      2        2         a     NaN     5.0
6      2        2         b     0.0    99.0
7      2        3         b     4.0     7.0

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
*columns Any

This refers to the columns to be completed. It could be column labels (string type), a list/tuple of column labels, or a dictionary that pairs column labels with new values.

()
sort bool

Sort DataFrame based on *columns.

False
by Optional[Union[list, str]]

Label or list of labels to group by. The explicit missing rows are returned per group.

None
fill_value Optional[Union[Dict, Any]]

Scalar value to use instead of NaN for missing combinations. A dictionary, mapping columns names to a scalar value is also accepted.

None
explicit bool

Determines if only implicitly missing values should be filled (False), or all nulls existing in the dataframe (True). explicit is applicable only if fill_value is not None.

True

Returns:

Type Description
DataFrame

A pandas DataFrame with explicit missing rows, if any.

Source code in janitor/functions/complete.py
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
@pf.register_dataframe_method
def complete(
    df: pd.DataFrame,
    *columns: Any,
    sort: bool = False,
    by: Optional[Union[list, str]] = None,
    fill_value: Optional[Union[Dict, Any]] = None,
    explicit: bool = True,
) -> pd.DataFrame:
    """Complete a data frame with missing combinations of data.

    It is modeled after tidyr's `complete` function, and is a wrapper around
    [`expand_grid`][janitor.functions.expand_grid.expand_grid], `pd.merge`
    and `pd.fillna`. In a way, it is the inverse of `pd.dropna`, as it exposes
    implicitly missing rows.

    Combinations of column names or a list/tuple of column names, or even a
    dictionary of column names and new values are possible.
    If a dictionary is passed,
    the user is required to ensure that the values are unique 1-D arrays.
    The keys in a dictionary must be present in the dataframe.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> import numpy as np
        >>> df = pd.DataFrame(
        ...     {
        ...         "Year": [1999, 2000, 2004, 1999, 2004],
        ...         "Taxon": [
        ...             "Saccharina",
        ...             "Saccharina",
        ...             "Saccharina",
        ...             "Agarum",
        ...             "Agarum",
        ...         ],
        ...         "Abundance": [4, 5, 2, 1, 8],
        ...     }
        ... )
        >>> df
           Year       Taxon  Abundance
        0  1999  Saccharina          4
        1  2000  Saccharina          5
        2  2004  Saccharina          2
        3  1999      Agarum          1
        4  2004      Agarum          8

        Expose missing pairings of `Year` and `Taxon`:
        >>> df.complete("Year", "Taxon", sort=True)
           Year       Taxon  Abundance
        0  1999      Agarum        1.0
        1  1999  Saccharina        4.0
        2  2000      Agarum        NaN
        3  2000  Saccharina        5.0
        4  2004      Agarum        8.0
        5  2004  Saccharina        2.0

        Expose missing years from 1999 to 2004:
        >>> df.complete(
        ...     {"Year": range(df.Year.min(), df.Year.max() + 1)},
        ...     "Taxon",
        ...     sort=True
        ... )
            Year       Taxon  Abundance
        0   1999      Agarum        1.0
        1   1999  Saccharina        4.0
        2   2000      Agarum        NaN
        3   2000  Saccharina        5.0
        4   2001      Agarum        NaN
        5   2001  Saccharina        NaN
        6   2002      Agarum        NaN
        7   2002  Saccharina        NaN
        8   2003      Agarum        NaN
        9   2003  Saccharina        NaN
        10  2004      Agarum        8.0
        11  2004  Saccharina        2.0

        Fill missing values:
        >>> df = pd.DataFrame(
        ...     dict(
        ...         group=(1, 2, 1, 2),
        ...         item_id=(1, 2, 2, 3),
        ...         item_name=("a", "a", "b", "b"),
        ...         value1=(1, np.nan, 3, 4),
        ...         value2=range(4, 8),
        ...     )
        ... )
        >>> df
           group  item_id item_name  value1  value2
        0      1        1         a     1.0       4
        1      2        2         a     NaN       5
        2      1        2         b     3.0       6
        3      2        3         b     4.0       7
        >>> df.complete(
        ...     "group",
        ...     ("item_id", "item_name"),
        ...     fill_value={"value1": 0, "value2": 99},
        ...     sort=True
        ... )
           group  item_id item_name  value1  value2
        0      1        1         a     1.0     4.0
        1      1        2         a     0.0    99.0
        2      1        2         b     3.0     6.0
        3      1        3         b     0.0    99.0
        4      2        1         a     0.0    99.0
        5      2        2         a     0.0     5.0
        6      2        2         b     0.0    99.0
        7      2        3         b     4.0     7.0

        Limit the fill to only implicit missing values
        by setting explicit to `False`:
        >>> df.complete(
        ...     "group",
        ...     ("item_id", "item_name"),
        ...     fill_value={"value1": 0, "value2": 99},
        ...     explicit=False,
        ...     sort=True
        ... )
           group  item_id item_name  value1  value2
        0      1        1         a     1.0     4.0
        1      1        2         a     0.0    99.0
        2      1        2         b     3.0     6.0
        3      1        3         b     0.0    99.0
        4      2        1         a     0.0    99.0
        5      2        2         a     NaN     5.0
        6      2        2         b     0.0    99.0
        7      2        3         b     4.0     7.0

    Args:
        df: A pandas DataFrame.
        *columns: This refers to the columns to be completed.
            It could be column labels (string type),
            a list/tuple of column labels, or a dictionary that pairs
            column labels with new values.
        sort: Sort DataFrame based on *columns.
        by: Label or list of labels to group by.
            The explicit missing rows are returned per group.
        fill_value: Scalar value to use instead of NaN
            for missing combinations. A dictionary, mapping columns names
            to a scalar value is also accepted.
        explicit: Determines if only implicitly missing values
            should be filled (`False`), or all nulls existing in the dataframe
            (`True`). `explicit` is applicable only
            if `fill_value` is not `None`.

    Returns:
        A pandas DataFrame with explicit missing rows, if any.
    """  # noqa: E501

    if not columns:
        return df

    # no copy made of the original dataframe
    # since pd.merge (computed some lines below)
    # makes a new object - essentially a copy
    return _computations_complete(df, columns, sort, by, fill_value, explicit)

concatenate_columns

concatenate_columns(df, column_names, new_column_name, sep='-', ignore_empty=True)

Concatenates the set of columns into a single column.

Used to quickly generate an index based on a group of columns.

This method mutates the original DataFrame.

Examples:

Concatenate two columns row-wise.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": [1, 3, 5], "b": list("xyz")})
>>> df
   a  b
0  1  x
1  3  y
2  5  z
>>> df.concatenate_columns(
...     column_names=["a", "b"], new_column_name="m",
... )
   a  b    m
0  1  x  1-x
1  3  y  3-y
2  5  z  5-z

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_names List[Hashable]

A list of columns to concatenate together.

required
new_column_name Hashable

The name of the new column.

required
sep str

The separator between each column's data.

'-'
ignore_empty bool

Ignore null values if exists.

True

Raises:

Type Description
JanitorError

If at least two columns are not provided within column_names.

Returns:

Type Description
DataFrame

A pandas DataFrame with concatenated columns.

Source code in janitor/functions/concatenate_columns.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
@pf.register_dataframe_method
@deprecated_alias(columns="column_names")
def concatenate_columns(
    df: pd.DataFrame,
    column_names: List[Hashable],
    new_column_name: Hashable,
    sep: str = "-",
    ignore_empty: bool = True,
) -> pd.DataFrame:
    """Concatenates the set of columns into a single column.

    Used to quickly generate an index based on a group of columns.

    This method mutates the original DataFrame.

    Examples:
        Concatenate two columns row-wise.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"a": [1, 3, 5], "b": list("xyz")})
        >>> df
           a  b
        0  1  x
        1  3  y
        2  5  z
        >>> df.concatenate_columns(
        ...     column_names=["a", "b"], new_column_name="m",
        ... )
           a  b    m
        0  1  x  1-x
        1  3  y  3-y
        2  5  z  5-z

    Args:
        df: A pandas DataFrame.
        column_names: A list of columns to concatenate together.
        new_column_name: The name of the new column.
        sep: The separator between each column's data.
        ignore_empty: Ignore null values if exists.

    Raises:
        JanitorError: If at least two columns are not provided
            within `column_names`.

    Returns:
        A pandas DataFrame with concatenated columns.
    """
    if len(column_names) < 2:
        raise JanitorError("At least two columns must be specified")

    df[new_column_name] = (
        df[column_names].astype(str).fillna("").agg(sep.join, axis=1)
    )

    if ignore_empty:

        def remove_empty_string(x):
            """Ignore empty/null string values from the concatenated output."""
            return sep.join(x for x in x.split(sep) if x)

        df[new_column_name] = df[new_column_name].transform(
            remove_empty_string
        )

    return df

conditional_join

conditional_join(df, right, *conditions, how='inner', sort_by_appearance=False, df_columns=slice(None), right_columns=slice(None), keep='all', use_numba=False, indicator=False, force=False)

The conditional_join function operates similarly to pd.merge, but supports joins on inequality operators, or a combination of equi and non-equi joins.

Joins solely on equality are not supported.

If the join is solely on equality, pd.merge function covers that; if you are interested in nearest joins, asof joins, or rolling joins, then pd.merge_asof covers that. There is also pandas' IntervalIndex, which is efficient for range joins, especially if the intervals do not overlap.

Column selection in df_columns and right_columns is possible using the select syntax.

Performance might be improved by setting use_numba to True. This assumes that numba is installed.

This function returns rows, if any, where values from df meet the condition(s) for values from right. The conditions are passed in as a variable argument of tuples, where the tuple is of the form (left_on, right_on, op); left_on is the column label from df, right_on is the column label from right, while op is the operator.

The col class is also supported in the conditional_join syntax.

For multiple conditions, the and(&) operator is used to combine the results of the individual conditions.

In some scenarios there might be performance gains if the less than join, or the greater than join condition, or the range condition is executed before the equi join - pass force=True to force this.

The operator can be any of ==, !=, <=, <, >=, >.

The join is done only on the columns.

For non-equi joins, only numeric, timedelta and date columns are supported.

inner, left, right and outer joins are supported.

If the columns from df and right have nothing in common, a single index column is returned; else, a MultiIndex column is returned.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df1 = pd.DataFrame({"value_1": [2, 5, 7, 1, 3, 4]})
>>> df2 = pd.DataFrame({"value_2A": [0, 3, 7, 12, 0, 2, 3, 1],
...                     "value_2B": [1, 5, 9, 15, 1, 4, 6, 3],
...                    })
>>> df1
   value_1
0        2
1        5
2        7
3        1
4        3
5        4
>>> df2
   value_2A  value_2B
0         0         1
1         3         5
2         7         9
3        12        15
4         0         1
5         2         4
6         3         6
7         1         3
>>> df1.conditional_join(
...     df2,
...     ("value_1", "value_2A", ">"),
...     ("value_1", "value_2B", "<")
... )
   value_1  value_2A  value_2B
0        2         1         3
1        5         3         6
2        3         2         4
3        4         3         5
4        4         3         6

Use the col class:

>>> df1.conditional_join(
...     df2,
...     col("value_1") > col("value_2A"),
...     col("value_1") < col("value_2B")
... )
   value_1  value_2A  value_2B
0        2         1         3
1        5         3         6
2        3         2         4
3        4         3         5
4        4         3         6

Select specific columns, after the join:

>>> df1.conditional_join(
...     df2,
...     col("value_1") > col("value_2A"),
...     col("value_1") < col("value_2B"),
...     right_columns='value_2B',
...     how='left'
... )
   value_1  value_2B
0        2       3.0
1        5       6.0
2        7       NaN
3        1       NaN
4        3       4.0
5        4       5.0
6        4       6.0

Rename columns, before the join:

>>> (df1
...  .rename(columns={'value_1':'left_column'})
...  .conditional_join(
...      df2,
...      ("left_column", "value_2A", ">"),
...      ("left_column", "value_2B", "<"),
...      right_columns='value_2B',
...      how='outer')
... )
    left_column  value_2B
0           7.0       NaN
1           1.0       NaN
2           2.0       3.0
3           5.0       6.0
4           3.0       4.0
5           4.0       5.0
6           4.0       6.0
7           NaN       1.0
8           NaN       9.0
9           NaN      15.0
10          NaN       1.0

Get the first match:

>>> df1.conditional_join(
...     df2,
...     col("value_1") > col("value_2A"),
...     col("value_1") < col("value_2B"),
...     keep='first'
... )
   value_1  value_2A  value_2B
0        2         1         3
1        5         3         6
2        3         2         4
3        4         3         5

Get the last match:

>>> df1.conditional_join(
...     df2,
...     col("value_1") > col("value_2A"),
...     col("value_1") < col("value_2B"),
...     keep='last'
... )
   value_1  value_2A  value_2B
0        2         1         3
1        5         3         6
2        3         2         4
3        4         3         6

Add an indicator column:

>>> df1.conditional_join(
...     df2,
...     ("value_1", "value_2A", ">"),
...     ("value_1", "value_2B", "<"),
...     how='outer',
...     indicator=True
... )
    value_1      _merge  value_2A  value_2B
0       7.0   left_only       NaN       NaN
1       1.0   left_only       NaN       NaN
2       2.0        both       1.0       3.0
3       5.0        both       3.0       6.0
4       3.0        both       2.0       4.0
5       4.0        both       3.0       5.0
6       4.0        both       3.0       6.0
7       NaN  right_only       0.0       1.0
8       NaN  right_only       7.0       9.0
9       NaN  right_only      12.0      15.0
10      NaN  right_only       0.0       1.0

Version Changed

  • 0.24.0
    • Added df_columns, right_columns, keep and use_numba parameters.
  • 0.24.1
    • Added indicator parameter.
  • 0.25.0
    • col class supported.
    • Outer join supported. sort_by_appearance deprecated.
    • Numba support for equi join
  • 0.27.0
    • Added support for timedelta dtype.

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
right Union[DataFrame, Series]

Named Series or DataFrame to join to.

required
conditions Any

Variable argument of tuple(s) of the form (left_on, right_on, op), where left_on is the column label from df, right_on is the column label from right, while op is the operator. The col class is also supported. The operator can be any of ==, !=, <=, <, >=, >. For multiple conditions, the and(&) operator is used to combine the results of the individual conditions.

()
how Literal['inner', 'left', 'right', 'outer']

Indicates the type of join to be performed. It can be one of inner, left, right or outer.

'inner'
sort_by_appearance bool

If how = inner and sort_by_appearance = False, there is no guarantee that the original order is preserved. Usually, this offers more performance. If how = left, the row order from the left dataframe is preserved; if how = right, the row order from the right dataframe is preserved.

Deprecated in 0.25.0

False
df_columns Optional[Any]

Columns to select from df in the final output dataframe. Column selection is based on the select syntax.

slice(None)
right_columns Optional[Any]

Columns to select from right in the final output dataframe. Column selection is based on the select syntax.

slice(None)
use_numba bool

Use numba, if installed, to accelerate the computation.

False
keep Literal['first', 'last', 'all']

Choose whether to return the first match, last match or all matches.

'all'
indicator Optional[Union[bool, str]]

If True, adds a column to the output DataFrame called _merge with information on the source of each row. The column can be given a different name by providing a string argument. The column will have a Categorical type with the value of left_only for observations whose merge key only appears in the left DataFrame, right_only for observations whose merge key only appears in the right DataFrame, and both if the observation’s merge key is found in both DataFrames.

False
force bool

If True, force the non-equi join conditions to execute before the equi join.

False

Returns:

Type Description
DataFrame

A pandas DataFrame of the two merged Pandas objects.

Source code in janitor/functions/conditional_join.py
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
@pf.register_dataframe_method
def conditional_join(
    df: pd.DataFrame,
    right: Union[pd.DataFrame, pd.Series],
    *conditions: Any,
    how: Literal["inner", "left", "right", "outer"] = "inner",
    sort_by_appearance: bool = False,
    df_columns: Optional[Any] = slice(None),
    right_columns: Optional[Any] = slice(None),
    keep: Literal["first", "last", "all"] = "all",
    use_numba: bool = False,
    indicator: Optional[Union[bool, str]] = False,
    force: bool = False,
) -> pd.DataFrame:
    """The conditional_join function operates similarly to `pd.merge`,
    but supports joins on inequality operators,
    or a combination of equi and non-equi joins.

    Joins solely on equality are not supported.

    If the join is solely on equality, `pd.merge` function
    covers that; if you are interested in nearest joins, asof joins,
    or rolling joins, then `pd.merge_asof` covers that.
    There is also pandas' IntervalIndex, which is efficient for range joins,
    especially if the intervals do not overlap.

    Column selection in `df_columns` and `right_columns` is possible using the
    [`select`][janitor.functions.select.select] syntax.

    Performance might be improved by setting `use_numba` to `True`.
    This assumes that `numba` is installed.

    This function returns rows, if any, where values from `df` meet the
    condition(s) for values from `right`. The conditions are passed in
    as a variable argument of tuples, where the tuple is of
    the form `(left_on, right_on, op)`; `left_on` is the column
    label from `df`, `right_on` is the column label from `right`,
    while `op` is the operator.

    The `col` class is also supported in the `conditional_join` syntax.

    For multiple conditions, the and(`&`)
    operator is used to combine the results of the individual conditions.

    In some scenarios there might be performance gains if the less than join,
    or the greater than join condition, or the range condition
    is executed before the equi join - pass `force=True` to force this.

    The operator can be any of `==`, `!=`, `<=`, `<`, `>=`, `>`.

    The join is done only on the columns.

    For non-equi joins, only numeric, timedelta and date columns are supported.

    `inner`, `left`, `right` and `outer` joins are supported.

    If the columns from `df` and `right` have nothing in common,
    a single index column is returned; else, a MultiIndex column
    is returned.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df1 = pd.DataFrame({"value_1": [2, 5, 7, 1, 3, 4]})
        >>> df2 = pd.DataFrame({"value_2A": [0, 3, 7, 12, 0, 2, 3, 1],
        ...                     "value_2B": [1, 5, 9, 15, 1, 4, 6, 3],
        ...                    })
        >>> df1
           value_1
        0        2
        1        5
        2        7
        3        1
        4        3
        5        4
        >>> df2
           value_2A  value_2B
        0         0         1
        1         3         5
        2         7         9
        3        12        15
        4         0         1
        5         2         4
        6         3         6
        7         1         3

        >>> df1.conditional_join(
        ...     df2,
        ...     ("value_1", "value_2A", ">"),
        ...     ("value_1", "value_2B", "<")
        ... )
           value_1  value_2A  value_2B
        0        2         1         3
        1        5         3         6
        2        3         2         4
        3        4         3         5
        4        4         3         6

        Use the `col` class:
        >>> df1.conditional_join(
        ...     df2,
        ...     col("value_1") > col("value_2A"),
        ...     col("value_1") < col("value_2B")
        ... )
           value_1  value_2A  value_2B
        0        2         1         3
        1        5         3         6
        2        3         2         4
        3        4         3         5
        4        4         3         6

        Select specific columns, after the join:
        >>> df1.conditional_join(
        ...     df2,
        ...     col("value_1") > col("value_2A"),
        ...     col("value_1") < col("value_2B"),
        ...     right_columns='value_2B',
        ...     how='left'
        ... )
           value_1  value_2B
        0        2       3.0
        1        5       6.0
        2        7       NaN
        3        1       NaN
        4        3       4.0
        5        4       5.0
        6        4       6.0

        Rename columns, before the join:
        >>> (df1
        ...  .rename(columns={'value_1':'left_column'})
        ...  .conditional_join(
        ...      df2,
        ...      ("left_column", "value_2A", ">"),
        ...      ("left_column", "value_2B", "<"),
        ...      right_columns='value_2B',
        ...      how='outer')
        ... )
            left_column  value_2B
        0           7.0       NaN
        1           1.0       NaN
        2           2.0       3.0
        3           5.0       6.0
        4           3.0       4.0
        5           4.0       5.0
        6           4.0       6.0
        7           NaN       1.0
        8           NaN       9.0
        9           NaN      15.0
        10          NaN       1.0

        Get the first match:
        >>> df1.conditional_join(
        ...     df2,
        ...     col("value_1") > col("value_2A"),
        ...     col("value_1") < col("value_2B"),
        ...     keep='first'
        ... )
           value_1  value_2A  value_2B
        0        2         1         3
        1        5         3         6
        2        3         2         4
        3        4         3         5

        Get the last match:
        >>> df1.conditional_join(
        ...     df2,
        ...     col("value_1") > col("value_2A"),
        ...     col("value_1") < col("value_2B"),
        ...     keep='last'
        ... )
           value_1  value_2A  value_2B
        0        2         1         3
        1        5         3         6
        2        3         2         4
        3        4         3         6

        Add an indicator column:
        >>> df1.conditional_join(
        ...     df2,
        ...     ("value_1", "value_2A", ">"),
        ...     ("value_1", "value_2B", "<"),
        ...     how='outer',
        ...     indicator=True
        ... )
            value_1      _merge  value_2A  value_2B
        0       7.0   left_only       NaN       NaN
        1       1.0   left_only       NaN       NaN
        2       2.0        both       1.0       3.0
        3       5.0        both       3.0       6.0
        4       3.0        both       2.0       4.0
        5       4.0        both       3.0       5.0
        6       4.0        both       3.0       6.0
        7       NaN  right_only       0.0       1.0
        8       NaN  right_only       7.0       9.0
        9       NaN  right_only      12.0      15.0
        10      NaN  right_only       0.0       1.0

    !!! abstract "Version Changed"

        - 0.24.0
            - Added `df_columns`, `right_columns`, `keep` and `use_numba` parameters.
        - 0.24.1
            - Added `indicator` parameter.
        - 0.25.0
            - `col` class supported.
            - Outer join supported. `sort_by_appearance` deprecated.
            - Numba support for equi join
        - 0.27.0
            - Added support for timedelta dtype.

    Args:
        df: A pandas DataFrame.
        right: Named Series or DataFrame to join to.
        conditions: Variable argument of tuple(s) of the form
            `(left_on, right_on, op)`, where `left_on` is the column
            label from `df`, `right_on` is the column label from `right`,
            while `op` is the operator.
            The `col` class is also supported. The operator can be any of
            `==`, `!=`, `<=`, `<`, `>=`, `>`. For multiple conditions,
            the and(`&`) operator is used to combine the results
            of the individual conditions.
        how: Indicates the type of join to be performed.
            It can be one of `inner`, `left`, `right` or `outer`.
        sort_by_appearance: If `how = inner` and
            `sort_by_appearance = False`, there
            is no guarantee that the original order is preserved.
            Usually, this offers more performance.
            If `how = left`, the row order from the left dataframe
            is preserved; if `how = right`, the row order
            from the right dataframe is preserved.
            !!!warning "Deprecated in 0.25.0"
        df_columns: Columns to select from `df` in the final output dataframe.
            Column selection is based on the
            [`select`][janitor.functions.select.select] syntax.
        right_columns: Columns to select from `right` in the final output dataframe.
            Column selection is based on the
            [`select`][janitor.functions.select.select] syntax.
        use_numba: Use numba, if installed, to accelerate the computation.
        keep: Choose whether to return the first match, last match or all matches.
        indicator: If `True`, adds a column to the output DataFrame
            called `_merge` with information on the source of each row.
            The column can be given a different name by providing a string argument.
            The column will have a Categorical type with the value of `left_only`
            for observations whose merge key only appears in the left DataFrame,
            `right_only` for observations whose merge key
            only appears in the right DataFrame, and `both` if the observation’s
            merge key is found in both DataFrames.
        force: If `True`, force the non-equi join conditions to execute before the equi join.


    Returns:
        A pandas DataFrame of the two merged Pandas objects.
    """  # noqa: E501

    return _conditional_join_compute(
        df,
        right,
        conditions,
        how,
        sort_by_appearance,
        df_columns,
        right_columns,
        keep,
        use_numba,
        indicator,
        force,
    )

get_join_indices(df, right, conditions, keep='all', use_numba=False, force=False)

Convenience function to return the matching indices from an inner join.

New in version 0.27.0

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
right Union[DataFrame, Series]

Named Series or DataFrame to join to.

required
conditions list[tuple[str]]

List of arguments of tuple(s) of the form (left_on, right_on, op), where left_on is the column label from df, right_on is the column label from right, while op is the operator. The col class is also supported. The operator can be any of ==, !=, <=, <, >=, >. For multiple conditions, the and(&) operator is used to combine the results of the individual conditions.

required
use_numba bool

Use numba, if installed, to accelerate the computation.

False
keep Literal['first', 'last', 'all']

Choose whether to return the first match, last match or all matches.

'all'
force bool

If True, force the non-equi join conditions to execute before the equi join.

False

Returns:

Type Description
tuple[ndarray, ndarray]

A tuple of indices for the rows in the dataframes that match.

Source code in janitor/functions/conditional_join.py
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
def get_join_indices(
    df: pd.DataFrame,
    right: Union[pd.DataFrame, pd.Series],
    conditions: list[tuple[str]],
    keep: Literal["first", "last", "all"] = "all",
    use_numba: bool = False,
    force: bool = False,
) -> tuple[np.ndarray, np.ndarray]:
    """Convenience function to return the matching indices from an inner join.

    !!! info "New in version 0.27.0"

    Args:
        df: A pandas DataFrame.
        right: Named Series or DataFrame to join to.
        conditions: List of arguments of tuple(s) of the form
            `(left_on, right_on, op)`, where `left_on` is the column
            label from `df`, `right_on` is the column label from `right`,
            while `op` is the operator.
            The `col` class is also supported. The operator can be any of
            `==`, `!=`, `<=`, `<`, `>=`, `>`. For multiple conditions,
            the and(`&`) operator is used to combine the results
            of the individual conditions.
        use_numba: Use numba, if installed, to accelerate the computation.
        keep: Choose whether to return the first match, last match or all matches.
        force: If `True`, force the non-equi join conditions
            to execute before the equi join.

    Returns:
        A tuple of indices for the rows in the dataframes that match.
    """
    return _conditional_join_compute(
        df=df,
        right=right,
        conditions=conditions,
        how="inner",
        sort_by_appearance=False,
        df_columns=None,
        right_columns=None,
        keep=keep,
        use_numba=use_numba,
        indicator=False,
        force=force,
        return_matching_indices=True,
    )

convert_date

convert_excel_date(df, column_name)

Convert Excel's serial date format into Python datetime format.

This method mutates the original DataFrame.

Implementation is also from Stack Overflow.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"date": [39690, 39690, 37118]})
>>> df
    date
0  39690
1  39690
2  37118
>>> df.convert_excel_date('date')
        date
0 2008-08-30
1 2008-08-30
2 2001-08-15

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_name Hashable

A column name.

required

Raises:

Type Description
ValueError

If there are non numeric values in the column.

Returns:

Type Description
DataFrame

A pandas DataFrame with corrected dates.

Source code in janitor/functions/convert_date.py
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
@pf.register_dataframe_method
@deprecated_alias(column="column_name")
def convert_excel_date(
    df: pd.DataFrame, column_name: Hashable
) -> pd.DataFrame:
    """Convert Excel's serial date format into Python datetime format.

    This method mutates the original DataFrame.

    Implementation is also from
    [Stack Overflow](https://stackoverflow.com/questions/38454403/convert-excel-style-date-with-pandas).

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"date": [39690, 39690, 37118]})
        >>> df
            date
        0  39690
        1  39690
        2  37118
        >>> df.convert_excel_date('date')
                date
        0 2008-08-30
        1 2008-08-30
        2 2001-08-15

    Args:
        df: A pandas DataFrame.
        column_name: A column name.

    Raises:
        ValueError: If there are non numeric values in the column.

    Returns:
        A pandas DataFrame with corrected dates.
    """  # noqa: E501

    if not is_numeric_dtype(df[column_name]):
        raise ValueError(
            "There are non-numeric values in the column. "
            "All values must be numeric."
        )

    df[column_name] = pd.TimedeltaIndex(
        df[column_name], unit="d"
    ) + dt.datetime(
        1899, 12, 30
    )  # noqa: W503
    return df

convert_matlab_date(df, column_name)

Convert Matlab's serial date number into Python datetime format.

Implementation is also from Stack Overflow.

This method mutates the original DataFrame.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"date": [737125.0, 737124.815863, 737124.4985, 737124]})
>>> df
            date
0  737125.000000
1  737124.815863
2  737124.498500
3  737124.000000
>>> df.convert_matlab_date('date')
                        date
0 2018-03-06 00:00:00.000000
1 2018-03-05 19:34:50.563200
2 2018-03-05 11:57:50.399999
3 2018-03-05 00:00:00.000000

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_name Hashable

A column name.

required

Returns:

Type Description
DataFrame

A pandas DataFrame with corrected dates.

Source code in janitor/functions/convert_date.py
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
@pf.register_dataframe_method
@deprecated_alias(column="column_name")
def convert_matlab_date(
    df: pd.DataFrame, column_name: Hashable
) -> pd.DataFrame:
    """Convert Matlab's serial date number into Python datetime format.

    Implementation is also from
    [Stack Overflow](https://stackoverflow.com/questions/13965740/converting-matlabs-datenum-format-to-python).

    This method mutates the original DataFrame.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"date": [737125.0, 737124.815863, 737124.4985, 737124]})
        >>> df
                    date
        0  737125.000000
        1  737124.815863
        2  737124.498500
        3  737124.000000
        >>> df.convert_matlab_date('date')
                                date
        0 2018-03-06 00:00:00.000000
        1 2018-03-05 19:34:50.563200
        2 2018-03-05 11:57:50.399999
        3 2018-03-05 00:00:00.000000

    Args:
        df: A pandas DataFrame.
        column_name: A column name.

    Returns:
        A pandas DataFrame with corrected dates.
    """  # noqa: E501
    days = pd.Series([dt.timedelta(v % 1) for v in df[column_name]])
    df[column_name] = (
        df[column_name].astype(int).apply(dt.datetime.fromordinal)
        + days
        - dt.timedelta(days=366)
    )
    return df

convert_unix_date(df, column_name)

Convert unix epoch time into Python datetime format.

Note that this ignores local tz and convert all timestamps to naive datetime based on UTC!

This method mutates the original DataFrame.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"date": [1651510462, 53394822, 1126233195]})
>>> df
         date
0  1651510462
1    53394822
2  1126233195
>>> df.convert_unix_date('date')
                 date
0 2022-05-02 16:54:22
1 1971-09-10 23:53:42
2 2005-09-09 02:33:15

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_name Hashable

A column name.

required

Returns:

Type Description
DataFrame

A pandas DataFrame with corrected dates.

Source code in janitor/functions/convert_date.py
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
@pf.register_dataframe_method
@deprecated_alias(column="column_name")
def convert_unix_date(df: pd.DataFrame, column_name: Hashable) -> pd.DataFrame:
    """Convert unix epoch time into Python datetime format.

    Note that this ignores local tz and convert all timestamps to naive
    datetime based on UTC!

    This method mutates the original DataFrame.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"date": [1651510462, 53394822, 1126233195]})
        >>> df
                 date
        0  1651510462
        1    53394822
        2  1126233195
        >>> df.convert_unix_date('date')
                         date
        0 2022-05-02 16:54:22
        1 1971-09-10 23:53:42
        2 2005-09-09 02:33:15

    Args:
        df: A pandas DataFrame.
        column_name: A column name.

    Returns:
        A pandas DataFrame with corrected dates.
    """

    try:
        df[column_name] = pd.to_datetime(df[column_name], unit="s")
    except OutOfBoundsDatetime:  # Indicates time is in milliseconds.
        df[column_name] = pd.to_datetime(df[column_name], unit="ms")
    return df

count_cumulative_unique

Implementation of count_cumulative_unique.

count_cumulative_unique(df, column_name, dest_column_name, case_sensitive=True)

Generates a running total of cumulative unique values in a given column.

A new column will be created containing a running count of unique values in the specified column. If case_sensitive is True, then the case of any letters will matter (i.e., a != A); otherwise, the case of any letters will not matter.

This method does not mutate the original DataFrame.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "letters": list("aabABb"),
...     "numbers": range(4, 10),
... })
>>> df
  letters  numbers
0       a        4
1       a        5
2       b        6
3       A        7
4       B        8
5       b        9
>>> df.count_cumulative_unique(
...     column_name="letters",
...     dest_column_name="letters_unique_count",
... )
  letters  numbers  letters_unique_count
0       a        4                     1
1       a        5                     1
2       b        6                     2
3       A        7                     3
4       B        8                     4
5       b        9                     4

Cumulative counts, ignoring casing.

>>> df.count_cumulative_unique(
...     column_name="letters",
...     dest_column_name="letters_unique_count",
...     case_sensitive=False,
... )
  letters  numbers  letters_unique_count
0       a        4                     1
1       a        5                     1
2       b        6                     2
3       A        7                     2
4       B        8                     2
5       b        9                     2

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_name Hashable

Name of the column containing values from which a running count of unique values will be created.

required
dest_column_name str

The name of the new column containing the cumulative count of unique values that will be created.

required
case_sensitive bool

Whether or not uppercase and lowercase letters will be considered equal. Only valid with string-like columns.

True

Raises:

Type Description
TypeError

If case_sensitive is False when counting a non-string column_name.

Returns:

Type Description
DataFrame

A pandas DataFrame with a new column containing a cumulative count of unique values from another column.

Source code in janitor/functions/count_cumulative_unique.py
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
@pf.register_dataframe_method
def count_cumulative_unique(
    df: pd.DataFrame,
    column_name: Hashable,
    dest_column_name: str,
    case_sensitive: bool = True,
) -> pd.DataFrame:
    """Generates a running total of cumulative unique values in a given column.

    A new column will be created containing a running
    count of unique values in the specified column.
    If `case_sensitive` is `True`, then the case of
    any letters will matter (i.e., `a != A`);
    otherwise, the case of any letters will not matter.

    This method does not mutate the original DataFrame.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "letters": list("aabABb"),
        ...     "numbers": range(4, 10),
        ... })
        >>> df
          letters  numbers
        0       a        4
        1       a        5
        2       b        6
        3       A        7
        4       B        8
        5       b        9
        >>> df.count_cumulative_unique(
        ...     column_name="letters",
        ...     dest_column_name="letters_unique_count",
        ... )
          letters  numbers  letters_unique_count
        0       a        4                     1
        1       a        5                     1
        2       b        6                     2
        3       A        7                     3
        4       B        8                     4
        5       b        9                     4

        Cumulative counts, ignoring casing.

        >>> df.count_cumulative_unique(
        ...     column_name="letters",
        ...     dest_column_name="letters_unique_count",
        ...     case_sensitive=False,
        ... )
          letters  numbers  letters_unique_count
        0       a        4                     1
        1       a        5                     1
        2       b        6                     2
        3       A        7                     2
        4       B        8                     2
        5       b        9                     2

    Args:
        df: A pandas DataFrame.
        column_name: Name of the column containing values from which a
            running count of unique values will be created.
        dest_column_name: The name of the new column containing the
            cumulative count of unique values that will be created.
        case_sensitive: Whether or not uppercase and lowercase letters
            will be considered equal. Only valid with string-like columns.

    Raises:
        TypeError: If `case_sensitive` is False when counting a non-string
            `column_name`.

    Returns:
        A pandas DataFrame with a new column containing a cumulative
            count of unique values from another column.
    """
    check_column(df, column_name)
    check_column(df, dest_column_name, present=False)

    counter = df[column_name]
    if not case_sensitive:
        try:
            # Make it so that the the same uppercase and lowercase
            # letter are treated as one unique value
            counter = counter.str.lower()
        except (AttributeError, TypeError) as e:
            # AttributeError is raised by pandas when .str is used on
            # non-string types, e.g. int.
            # TypeError is raised by pandas when .str.lower is used on a
            # forbidden string type, e.g. bytes.
            raise TypeError(
                "case_sensitive=False can only be used with a string-like "
                f"type. Column {column_name} is {counter.dtype} type."
            ) from e

    counter = (
        counter.groupby(counter, sort=False).cumcount().to_numpy(copy=False)
    )
    counter = np.cumsum(counter == 0)

    return df.assign(**{dest_column_name: counter})

currency_column_to_numeric

currency_column_to_numeric(df, column_name, cleaning_style=None, cast_non_numeric=None, fill_all_non_numeric=None, remove_non_numeric=False)

Convert currency column to numeric.

This method does not mutate the original DataFrame.

This method allows one to take a column containing currency values, inadvertently imported as a string, and cast it as a float. This is usually the case when reading CSV files that were modified in Excel. Empty strings (i.e. '') are retained as NaN values.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "a_col": [" 24.56", "-", "(12.12)", "1,000,000"],
...     "d_col": ["", "foo", "1.23 dollars", "-1,000 yen"],
... })
>>> df
       a_col         d_col
0      24.56
1          -           foo
2    (12.12)  1.23 dollars
3  1,000,000    -1,000 yen

The default cleaning style.

>>> df.currency_column_to_numeric("d_col")
       a_col    d_col
0      24.56      NaN
1          -      NaN
2    (12.12)     1.23
3  1,000,000 -1000.00

The accounting cleaning style.

>>> df.currency_column_to_numeric("a_col", cleaning_style="accounting")
        a_col         d_col
0       24.56
1        0.00           foo
2      -12.12  1.23 dollars
3  1000000.00    -1,000 yen

Valid cleaning styles are:

  • None: Default cleaning is applied. Empty strings are always retained as NaN. Numbers, -, . are extracted and the resulting string is cast to a float.
  • 'accounting': Replaces numbers in parentheses with negatives, removes commas.

Parameters:

Name Type Description Default
df DataFrame

The pandas DataFrame.

required
column_name str

The column containing currency values to modify.

required
cleaning_style Optional[str]

What style of cleaning to perform.

None
cast_non_numeric Optional[dict]

A dict of how to coerce certain strings to numeric type. For example, if there are values of 'REORDER' in the DataFrame, {'REORDER': 0} will cast all instances of 'REORDER' to 0. Only takes effect in the default cleaning style.

None
fill_all_non_numeric Optional[Union[float, int]]

Similar to cast_non_numeric, but fills all strings to the same value. For example, fill_all_non_numeric=1, will make everything that doesn't coerce to a currency 1. Only takes effect in the default cleaning style.

None
remove_non_numeric bool

If set to True, rows of df that contain non-numeric values in the column_name column will be removed. Only takes effect in the default cleaning style.

False

Raises:

Type Description
ValueError

If cleaning_style is not one of the accepted styles.

Returns:

Type Description
DataFrame

A pandas DataFrame.

Source code in janitor/functions/currency_column_to_numeric.py
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
@pf.register_dataframe_method
@deprecated_alias(col_name="column_name", type="cleaning_style")
def currency_column_to_numeric(
    df: pd.DataFrame,
    column_name: str,
    cleaning_style: Optional[str] = None,
    cast_non_numeric: Optional[dict] = None,
    fill_all_non_numeric: Optional[Union[float, int]] = None,
    remove_non_numeric: bool = False,
) -> pd.DataFrame:
    """Convert currency column to numeric.

    This method does not mutate the original DataFrame.

    This method allows one to take a column containing currency values,
    inadvertently imported as a string, and cast it as a float. This is
    usually the case when reading CSV files that were modified in Excel.
    Empty strings (i.e. `''`) are retained as `NaN` values.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "a_col": [" 24.56", "-", "(12.12)", "1,000,000"],
        ...     "d_col": ["", "foo", "1.23 dollars", "-1,000 yen"],
        ... })
        >>> df  # doctest: +NORMALIZE_WHITESPACE
               a_col         d_col
        0      24.56
        1          -           foo
        2    (12.12)  1.23 dollars
        3  1,000,000    -1,000 yen

        The default cleaning style.

        >>> df.currency_column_to_numeric("d_col")
               a_col    d_col
        0      24.56      NaN
        1          -      NaN
        2    (12.12)     1.23
        3  1,000,000 -1000.00

        The accounting cleaning style.

        >>> df.currency_column_to_numeric("a_col", cleaning_style="accounting")  # doctest: +NORMALIZE_WHITESPACE
                a_col         d_col
        0       24.56
        1        0.00           foo
        2      -12.12  1.23 dollars
        3  1000000.00    -1,000 yen

    Valid cleaning styles are:

    - `None`: Default cleaning is applied. Empty strings are always retained as
        `NaN`. Numbers, `-`, `.` are extracted and the resulting string
        is cast to a float.
    - `'accounting'`: Replaces numbers in parentheses with negatives, removes commas.

    Args:
        df: The pandas DataFrame.
        column_name: The column containing currency values to modify.
        cleaning_style: What style of cleaning to perform.
        cast_non_numeric: A dict of how to coerce certain strings to numeric
            type. For example, if there are values of 'REORDER' in the DataFrame,
            `{'REORDER': 0}` will cast all instances of 'REORDER' to 0.
            Only takes effect in the default cleaning style.
        fill_all_non_numeric: Similar to `cast_non_numeric`, but fills all
            strings to the same value. For example, `fill_all_non_numeric=1`, will
            make everything that doesn't coerce to a currency `1`.
            Only takes effect in the default cleaning style.
        remove_non_numeric: If set to True, rows of `df` that contain
            non-numeric values in the `column_name` column will be removed.
            Only takes effect in the default cleaning style.

    Raises:
        ValueError: If `cleaning_style` is not one of the accepted styles.

    Returns:
        A pandas DataFrame.
    """  # noqa: E501

    check("column_name", column_name, [str])
    check_column(df, column_name)

    column_series = df[column_name]
    if cleaning_style == "accounting":
        outcome = (
            df[column_name]
            .str.strip()
            .str.replace(",", "", regex=False)
            .str.replace(")", "", regex=False)
            .str.replace("(", "-", regex=False)
            .replace({"-": 0.0})
            .astype(float)
        )
        return df.assign(**{column_name: outcome})
    if cleaning_style is not None:
        raise ValueError(
            "`cleaning_style` is expected to be one of ('accounting', None). "
            f"Got {cleaning_style!r} instead."
        )

    if cast_non_numeric:
        check("cast_non_numeric", cast_non_numeric, [dict])

    _make_cc_patrial = partial(
        _currency_column_to_numeric,
        cast_non_numeric=cast_non_numeric,
    )
    column_series = column_series.apply(_make_cc_patrial)

    if remove_non_numeric:
        df = df.loc[column_series != "", :]

    # _replace_empty_string_with_none is applied here after the check on
    # remove_non_numeric since "" is our indicator that a string was coerced
    # in the original column
    column_series = _replace_empty_string_with_none(column_series)

    if fill_all_non_numeric is not None:
        check("fill_all_non_numeric", fill_all_non_numeric, [int, float])
        column_series = column_series.fillna(fill_all_non_numeric)

    column_series = _replace_original_empty_string_with_none(column_series)

    df = df.assign(**{column_name: pd.to_numeric(column_series)})

    return df

deconcatenate_column

Implementation of deconcatenating columns.

deconcatenate_column(df, column_name, sep=None, new_column_names=None, autoname=None, preserve_position=False)

De-concatenates a single column into multiple columns.

The column to de-concatenate can be either a collection (list, tuple, ...) which can be separated out with pd.Series.tolist(), or a string to slice based on sep.

To determine this behaviour automatically, the first element in the column specified is inspected.

If it is a string, then sep must be specified. Else, the function assumes that it is an iterable type (e.g. list or tuple), and will attempt to deconcatenate by splitting the list.

Given a column with string values, this is the inverse of the concatenate_columns function.

Used to quickly split columns out of a single column.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"m": ["1-x", "2-y", "3-z"]})
>>> df
     m
0  1-x
1  2-y
2  3-z
>>> df.deconcatenate_column("m", sep="-", autoname="col")
     m col1 col2
0  1-x    1    x
1  2-y    2    y
2  3-z    3    z

The keyword argument preserve_position takes True or False boolean that controls whether the new_column_names will take the original position of the to-be-deconcatenated column_name:

  • When preserve_position=False (default), df.columns change from [..., column_name, ...] to [..., column_name, ..., new_column_names]. In other words, the deconcatenated new columns are appended to the right of the original dataframe and the original column_name is NOT dropped.
  • When preserve_position=True, df.column change from [..., column_name, ...] to [..., new_column_names, ...]. In other words, the deconcatenated new column will REPLACE the original column_name at its original position, and column_name itself is dropped.

The keyword argument autoname accepts a base string and then automatically creates numbered column names based off the base string. For example, if col is passed in as the argument to autoname, and 4 columns are created, then the resulting columns will be named col1, col2, col3, col4. Numbering is always 1-indexed, not 0-indexed, in order to make the column names human-friendly.

This method does not mutate the original DataFrame.

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_name Hashable

The column to split.

required
sep Optional[str]

The separator delimiting the column's data.

None
new_column_names Optional[Union[List[str], Tuple[str]]]

A list of new column names post-splitting.

None
autoname str

A base name for automatically naming the new columns. Takes precedence over new_column_names if both are provided.

None
preserve_position bool

Boolean for whether or not to preserve original position of the column upon de-concatenation.

False

Raises:

Type Description
ValueError

If column_name is not present in the DataFrame.

ValueError

If sep is not provided and the column values are of type str.

ValueError

If either new_column_names or autoname is not supplied.

JanitorError

If incorrect number of names is provided within new_column_names.

Returns:

Type Description
DataFrame

A pandas DataFrame with a deconcatenated column.

Source code in janitor/functions/deconcatenate_column.py
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
@pf.register_dataframe_method
@deprecated_alias(column="column_name")
def deconcatenate_column(
    df: pd.DataFrame,
    column_name: Hashable,
    sep: Optional[str] = None,
    new_column_names: Optional[Union[List[str], Tuple[str]]] = None,
    autoname: str = None,
    preserve_position: bool = False,
) -> pd.DataFrame:
    """De-concatenates a single column into multiple columns.

    The column to de-concatenate can be either a collection (list, tuple, ...)
    which can be separated out with `pd.Series.tolist()`,
    or a string to slice based on `sep`.

    To determine this behaviour automatically,
    the first element in the column specified is inspected.

    If it is a string, then `sep` must be specified.
    Else, the function assumes that it is an iterable type
    (e.g. `list` or `tuple`),
    and will attempt to deconcatenate by splitting the list.

    Given a column with string values, this is the inverse of the
    [`concatenate_columns`][janitor.functions.concatenate_columns.concatenate_columns]
    function.

    Used to quickly split columns out of a single column.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"m": ["1-x", "2-y", "3-z"]})
        >>> df
             m
        0  1-x
        1  2-y
        2  3-z
        >>> df.deconcatenate_column("m", sep="-", autoname="col")
             m col1 col2
        0  1-x    1    x
        1  2-y    2    y
        2  3-z    3    z

    The keyword argument `preserve_position`
    takes `True` or `False` boolean
    that controls whether the `new_column_names`
    will take the original position
    of the to-be-deconcatenated `column_name`:

    - When `preserve_position=False` (default), `df.columns` change from
      `[..., column_name, ...]` to `[..., column_name, ..., new_column_names]`.
      In other words, the deconcatenated new columns are appended to the right
      of the original dataframe and the original `column_name` is NOT dropped.
    - When `preserve_position=True`, `df.column` change from
      `[..., column_name, ...]` to `[..., new_column_names, ...]`.
      In other words, the deconcatenated new column will REPLACE the original
      `column_name` at its original position, and `column_name` itself
      is dropped.

    The keyword argument `autoname` accepts a base string
    and then automatically creates numbered column names
    based off the base string.
    For example, if `col` is passed in as the argument to `autoname`,
    and 4 columns are created, then the resulting columns will be named
    `col1, col2, col3, col4`.
    Numbering is always 1-indexed, not 0-indexed,
    in order to make the column names human-friendly.

    This method does not mutate the original DataFrame.

    Args:
        df: A pandas DataFrame.
        column_name: The column to split.
        sep: The separator delimiting the column's data.
        new_column_names: A list of new column names post-splitting.
        autoname: A base name for automatically naming the new columns.
            Takes precedence over `new_column_names` if both are provided.
        preserve_position: Boolean for whether or not to preserve original
            position of the column upon de-concatenation.

    Raises:
        ValueError: If `column_name` is not present in the DataFrame.
        ValueError: If `sep` is not provided and the column values
            are of type `str`.
        ValueError: If either `new_column_names` or `autoname`
            is not supplied.
        JanitorError: If incorrect number of names is provided
            within `new_column_names`.

    Returns:
        A pandas DataFrame with a deconcatenated column.
    """  # noqa: E501

    if column_name not in df.columns:
        raise ValueError(f"column name {column_name} not present in DataFrame")

    if isinstance(df[column_name].iloc[0], str):
        if sep is None:
            raise ValueError(
                "`sep` must be specified if the column values "
                "are of type `str`."
            )
        df_deconcat = df[column_name].str.split(sep, expand=True)
    else:
        df_deconcat = pd.DataFrame(
            df[column_name].to_list(), columns=new_column_names, index=df.index
        )

    if new_column_names is None and autoname is None:
        raise ValueError(
            "One of `new_column_names` or `autoname` must be supplied."
        )

    if autoname:
        new_column_names = [
            f"{autoname}{i}" for i in range(1, df_deconcat.shape[1] + 1)
        ]

    if not len(new_column_names) == df_deconcat.shape[1]:
        raise JanitorError(
            f"You need to provide {len(df_deconcat.shape[1])} names "
            "to `new_column_names`"
        )

    df_deconcat.columns = new_column_names
    df_new = pd.concat([df, df_deconcat], axis=1)

    if preserve_position:
        df_original = df.copy()
        cols = list(df_original.columns)
        index_original = cols.index(column_name)

        for i, col_new in enumerate(new_column_names):
            cols.insert(index_original + i, col_new)

        df_new = df_new.select(cols, axis="columns").drop(columns=column_name)

    return df_new

drop_constant_columns

Implementation of drop_constant_columns.

drop_constant_columns(df)

Finds and drops the constant columns from a Pandas DataFrame.

Examples:

>>> import pandas as pd
>>> import janitor
>>> data_dict = {
...     "a": [1, 1, 1],
...     "b": [1, 2, 3],
...     "c": [1, 1, 1],
...     "d": ["rabbit", "leopard", "lion"],
...     "e": ["Cambridge", "Shanghai", "Basel"]
... }
>>> df = pd.DataFrame(data_dict)
>>> df
   a  b  c        d          e
0  1  1  1   rabbit  Cambridge
1  1  2  1  leopard   Shanghai
2  1  3  1     lion      Basel
>>> df.drop_constant_columns()
   b        d          e
0  1   rabbit  Cambridge
1  2  leopard   Shanghai
2  3     lion      Basel

Parameters:

Name Type Description Default
df DataFrame

Input Pandas DataFrame

required

Returns:

Type Description
DataFrame

The Pandas DataFrame with the constant columns dropped.

Source code in janitor/functions/drop_constant_columns.py
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
@pf.register_dataframe_method
def drop_constant_columns(df: pd.DataFrame) -> pd.DataFrame:
    """Finds and drops the constant columns from a Pandas DataFrame.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> data_dict = {
        ...     "a": [1, 1, 1],
        ...     "b": [1, 2, 3],
        ...     "c": [1, 1, 1],
        ...     "d": ["rabbit", "leopard", "lion"],
        ...     "e": ["Cambridge", "Shanghai", "Basel"]
        ... }
        >>> df = pd.DataFrame(data_dict)
        >>> df
           a  b  c        d          e
        0  1  1  1   rabbit  Cambridge
        1  1  2  1  leopard   Shanghai
        2  1  3  1     lion      Basel
        >>> df.drop_constant_columns()
           b        d          e
        0  1   rabbit  Cambridge
        1  2  leopard   Shanghai
        2  3     lion      Basel

    Args:
        df: Input Pandas DataFrame

    Returns:
        The Pandas DataFrame with the constant columns dropped.
    """
    return df.loc[:, df.nunique().ne(1)]

drop_duplicate_columns

Implementation for drop_duplicate_columns.

drop_duplicate_columns(df, column_name, nth_index=0)

Remove a duplicated column specified by column_name.

Specifying nth_index=0 will remove the first column, nth_index=1 will remove the second column, and so on and so forth.

The corresponding tidyverse R's library is: select(-<column_name>_<nth_index + 1>)

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "a": range(2, 5),
...     "b": range(3, 6),
...     "A": range(4, 7),
...     "a*": range(6, 9),
... }).clean_names(remove_special=True)
>>> df
   a  b  a  a
0  2  3  4  6
1  3  4  5  7
2  4  5  6  8
>>> df.drop_duplicate_columns(column_name="a", nth_index=1)
   a  b  a
0  2  3  6
1  3  4  7
2  4  5  8

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame

required
column_name Hashable

Name of duplicated columns.

required
nth_index int

Among the duplicated columns, select the nth column to drop.

0

Returns:

Type Description
DataFrame

A pandas DataFrame

Source code in janitor/functions/drop_duplicate_columns.py
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
@pf.register_dataframe_method
def drop_duplicate_columns(
    df: pd.DataFrame, column_name: Hashable, nth_index: int = 0
) -> pd.DataFrame:
    """Remove a duplicated column specified by `column_name`.

    Specifying `nth_index=0` will remove the first column,
    `nth_index=1` will remove the second column,
    and so on and so forth.

    The corresponding tidyverse R's library is:
    `select(-<column_name>_<nth_index + 1>)`

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "a": range(2, 5),
        ...     "b": range(3, 6),
        ...     "A": range(4, 7),
        ...     "a*": range(6, 9),
        ... }).clean_names(remove_special=True)
        >>> df
           a  b  a  a
        0  2  3  4  6
        1  3  4  5  7
        2  4  5  6  8
        >>> df.drop_duplicate_columns(column_name="a", nth_index=1)
           a  b  a
        0  2  3  6
        1  3  4  7
        2  4  5  8

    Args:
        df: A pandas DataFrame
        column_name: Name of duplicated columns.
        nth_index: Among the duplicated columns,
            select the nth column to drop.

    Returns:
        A pandas DataFrame
    """
    col_indexes = [
        col_idx
        for col_idx, col_name in enumerate(df.columns)
        if col_name == column_name
    ]

    # Select the column to remove based on nth_index.
    removed_col_idx = col_indexes[nth_index]
    # Filter out columns except for the one to be removed.
    filtered_cols = [
        c_i for c_i, _ in enumerate(df.columns) if c_i != removed_col_idx
    ]

    return df.iloc[:, filtered_cols]

dropnotnull

Implementation source for dropnotnull.

dropnotnull(df, column_name)

Drop rows that do not have null values in the given column.

This method does not mutate the original DataFrame.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": [1., np.NaN, 3.], "b": [None, "y", "z"]})
>>> df
     a     b
0  1.0  None
1  NaN     y
2  3.0     z
>>> df.dropnotnull("a")
    a  b
1 NaN  y
>>> df.dropnotnull("b")
     a     b
0  1.0  None

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_name Hashable

The column name to drop rows from.

required

Returns:

Type Description
DataFrame

A pandas DataFrame with dropped rows.

Source code in janitor/functions/dropnotnull.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
@pf.register_dataframe_method
@deprecated_alias(column="column_name")
def dropnotnull(df: pd.DataFrame, column_name: Hashable) -> pd.DataFrame:
    """Drop rows that do *not* have null values in the given column.

    This method does not mutate the original DataFrame.

    Examples:
        >>> import numpy as np
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"a": [1., np.NaN, 3.], "b": [None, "y", "z"]})
        >>> df
             a     b
        0  1.0  None
        1  NaN     y
        2  3.0     z
        >>> df.dropnotnull("a")
            a  b
        1 NaN  y
        >>> df.dropnotnull("b")
             a     b
        0  1.0  None

    Args:
        df: A pandas DataFrame.
        column_name: The column name to drop rows from.

    Returns:
        A pandas DataFrame with dropped rows.
    """
    return df[pd.isna(df[column_name])]

encode_categorical

encode_categorical(df, column_names=None, **kwargs)

Encode the specified columns with Pandas' category dtype.

It is syntactic sugar around pd.Categorical.

This method does not mutate the original DataFrame.

Simply pass a string, or a sequence of column names to column_names; alternatively, you can pass kwargs, where the keys are the column names and the values can either be None, sort, appearance or a 1-D array-like object.

  • None: column is cast to an unordered categorical.
  • sort: column is cast to an ordered categorical, with the order defined by the sort-order of the categories.
  • appearance: column is cast to an ordered categorical, with the order defined by the order of appearance in the original column.
  • 1d-array-like object: column is cast to an ordered categorical, with the categories and order as specified in the input array.

column_names and kwargs parameters cannot be used at the same time.

Examples:

Using column_names

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "foo": ["b", "b", "a", "c", "b"],
...     "bar": range(4, 9),
... })
>>> df
  foo  bar
0   b    4
1   b    5
2   a    6
3   c    7
4   b    8
>>> df.dtypes
foo    object
bar     int64
dtype: object
>>> enc_df = df.encode_categorical(column_names="foo")
>>> enc_df.dtypes
foo    category
bar       int64
dtype: object
>>> enc_df["foo"].cat.categories
Index(['a', 'b', 'c'], dtype='object')
>>> enc_df["foo"].cat.ordered
False

Using kwargs to specify an ordered categorical.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "foo": ["b", "b", "a", "c", "b"],
...     "bar": range(4, 9),
... })
>>> df.dtypes
foo    object
bar     int64
dtype: object
>>> enc_df = df.encode_categorical(foo="appearance")
>>> enc_df.dtypes
foo    category
bar       int64
dtype: object
>>> enc_df["foo"].cat.categories
Index(['b', 'a', 'c'], dtype='object')
>>> enc_df["foo"].cat.ordered
True

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame object.

required
column_names Union[str, Iterable[str], Hashable]

A column name or an iterable (list or tuple) of column names.

None
**kwargs Any

A mapping from column name to either None, 'sort' or 'appearance', or a 1-D array. This is useful in creating categorical columns that are ordered, or if the user needs to explicitly specify the categories.

{}

Raises:

Type Description
ValueError

If both column_names and kwargs are provided.

Returns:

Type Description
DataFrame

A pandas DataFrame.

Source code in janitor/functions/encode_categorical.py
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
@pf.register_dataframe_method
@deprecated_alias(columns="column_names")
def encode_categorical(
    df: pd.DataFrame,
    column_names: Union[str, Iterable[str], Hashable] = None,
    **kwargs: Any,
) -> pd.DataFrame:
    """Encode the specified columns with Pandas' [category dtype][cat].

    [cat]: http://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html

    It is syntactic sugar around `pd.Categorical`.

    This method does not mutate the original DataFrame.

    Simply pass a string, or a sequence of column names to `column_names`;
    alternatively, you can pass kwargs, where the keys are the column names
    and the values can either be None, `sort`, `appearance`
    or a 1-D array-like object.

    - None: column is cast to an unordered categorical.
    - `sort`: column is cast to an ordered categorical,
              with the order defined by the sort-order of the categories.
    - `appearance`: column is cast to an ordered categorical,
                    with the order defined by the order of appearance
                    in the original column.
    - 1d-array-like object: column is cast to an ordered categorical,
                            with the categories and order as specified
                            in the input array.

    `column_names` and `kwargs` parameters cannot be used at the same time.

    Examples:
        Using `column_names`

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "foo": ["b", "b", "a", "c", "b"],
        ...     "bar": range(4, 9),
        ... })
        >>> df
          foo  bar
        0   b    4
        1   b    5
        2   a    6
        3   c    7
        4   b    8
        >>> df.dtypes
        foo    object
        bar     int64
        dtype: object
        >>> enc_df = df.encode_categorical(column_names="foo")
        >>> enc_df.dtypes
        foo    category
        bar       int64
        dtype: object
        >>> enc_df["foo"].cat.categories
        Index(['a', 'b', 'c'], dtype='object')
        >>> enc_df["foo"].cat.ordered
        False

        Using `kwargs` to specify an ordered categorical.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "foo": ["b", "b", "a", "c", "b"],
        ...     "bar": range(4, 9),
        ... })
        >>> df.dtypes
        foo    object
        bar     int64
        dtype: object
        >>> enc_df = df.encode_categorical(foo="appearance")
        >>> enc_df.dtypes
        foo    category
        bar       int64
        dtype: object
        >>> enc_df["foo"].cat.categories
        Index(['b', 'a', 'c'], dtype='object')
        >>> enc_df["foo"].cat.ordered
        True

    Args:
        df: A pandas DataFrame object.
        column_names: A column name or an iterable (list or tuple)
            of column names.
        **kwargs: A mapping from column name to either `None`,
            `'sort'` or `'appearance'`, or a 1-D array. This is useful
            in creating categorical columns that are ordered, or
            if the user needs to explicitly specify the categories.

    Raises:
        ValueError: If both `column_names` and `kwargs` are provided.

    Returns:
        A pandas DataFrame.
    """  # noqa: E501

    if all((column_names, kwargs)):
        raise ValueError(
            "Only one of `column_names` or `kwargs` can be provided."
        )
    # column_names deal with only category dtype (unordered)
    # kwargs takes care of scenarios where user wants an ordered category
    # or user supplies specific categories to create the categorical
    if column_names is not None:
        column_names = get_index_labels([column_names], df, axis="columns")
        dtypes = {col: "category" for col in column_names}
        return df.astype(dtypes)

    return _computations_as_categorical(df, **kwargs)

expand_column

Implementation for expand_column.

expand_column(df, column_name, sep='|', concat=True)

Expand a categorical column with multiple labels into dummy-coded columns.

Super sugary syntax that wraps pandas.Series.str.get_dummies.

This method does not mutate the original DataFrame.

Examples:

Functional usage syntax:

>>> import pandas as pd
>>> df = pd.DataFrame(
...     {
...         "col1": ["A, B", "B, C, D", "E, F", "A, E, F"],
...         "col2": [1, 2, 3, 4],
...     }
... )
>>> df = expand_column(
...     df,
...     column_name="col1",
...     sep=", "  # note space in sep
... )
>>> df
      col1  col2  A  B  C  D  E  F
0     A, B     1  1  1  0  0  0  0
1  B, C, D     2  0  1  1  1  0  0
2     E, F     3  0  0  0  0  1  1
3  A, E, F     4  1  0  0  0  1  1

Method chaining syntax:

>>> import pandas as pd
>>> import janitor
>>> df = (
...     pd.DataFrame(
...         {
...             "col1": ["A, B", "B, C, D", "E, F", "A, E, F"],
...             "col2": [1, 2, 3, 4],
...         }
...     )
...     .expand_column(
...         column_name='col1',
...         sep=', '
...     )
... )
>>> df
      col1  col2  A  B  C  D  E  F
0     A, B     1  1  1  0  0  0  0
1  B, C, D     2  0  1  1  1  0  0
2     E, F     3  0  0  0  0  1  1
3  A, E, F     4  1  0  0  0  1  1

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_name Hashable

Which column to expand.

required
sep str

The delimiter, same to pandas.Series.str.get_dummies's sep.

'|'
concat bool

Whether to return the expanded column concatenated to the original dataframe (concat=True), or to return it standalone (concat=False).

True

Returns:

Type Description
DataFrame

A pandas DataFrame with an expanded column.

Source code in janitor/functions/expand_column.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
@pf.register_dataframe_method
@deprecated_alias(column="column_name")
def expand_column(
    df: pd.DataFrame,
    column_name: Hashable,
    sep: str = "|",
    concat: bool = True,
) -> pd.DataFrame:
    """Expand a categorical column with multiple labels into dummy-coded columns.

    Super sugary syntax that wraps `pandas.Series.str.get_dummies`.

    This method does not mutate the original DataFrame.

    Examples:
        Functional usage syntax:

        >>> import pandas as pd
        >>> df = pd.DataFrame(
        ...     {
        ...         "col1": ["A, B", "B, C, D", "E, F", "A, E, F"],
        ...         "col2": [1, 2, 3, 4],
        ...     }
        ... )
        >>> df = expand_column(
        ...     df,
        ...     column_name="col1",
        ...     sep=", "  # note space in sep
        ... )
        >>> df
              col1  col2  A  B  C  D  E  F
        0     A, B     1  1  1  0  0  0  0
        1  B, C, D     2  0  1  1  1  0  0
        2     E, F     3  0  0  0  0  1  1
        3  A, E, F     4  1  0  0  0  1  1

        Method chaining syntax:

        >>> import pandas as pd
        >>> import janitor
        >>> df = (
        ...     pd.DataFrame(
        ...         {
        ...             "col1": ["A, B", "B, C, D", "E, F", "A, E, F"],
        ...             "col2": [1, 2, 3, 4],
        ...         }
        ...     )
        ...     .expand_column(
        ...         column_name='col1',
        ...         sep=', '
        ...     )
        ... )
        >>> df
              col1  col2  A  B  C  D  E  F
        0     A, B     1  1  1  0  0  0  0
        1  B, C, D     2  0  1  1  1  0  0
        2     E, F     3  0  0  0  0  1  1
        3  A, E, F     4  1  0  0  0  1  1

    Args:
        df: A pandas DataFrame.
        column_name: Which column to expand.
        sep: The delimiter, same to
            `pandas.Series.str.get_dummies`'s `sep`.
        concat: Whether to return the expanded column concatenated to
            the original dataframe (`concat=True`), or to return it standalone
            (`concat=False`).

    Returns:
        A pandas DataFrame with an expanded column.
    """  # noqa: E501
    expanded_df = df[column_name].str.get_dummies(sep=sep)
    if concat:
        return df.join(expanded_df)
    return expanded_df

expand_grid

Implementation source for expand_grid.

expand_grid(df=None, df_key=None, *, others=None)

Creates a DataFrame from a cartesian combination of all inputs.

It is not restricted to DataFrame; it can work with any list-like structure that is 1 or 2 dimensional.

If method-chaining to a DataFrame, a string argument to df_key parameter must be provided.

Data types are preserved in this function, including pandas' extension array dtypes.

The output will always be a DataFrame, usually with a MultiIndex column, with the keys of the others dictionary serving as the top level columns.

If a pandas Series/DataFrame is passed, and has a labeled index, or a MultiIndex index, the index is discarded; the final DataFrame will have a RangeIndex.

The MultiIndexed DataFrame can be flattened using pyjanitor's collapse_levels method; the user can also decide to drop any of the levels, via pandas' droplevel method.

Examples:

>>> import pandas as pd
>>> from janitor.functions.expand_grid import expand_grid
>>> df = pd.DataFrame({"x": [1, 2], "y": [2, 1]})
>>> data = {"z": [1, 2, 3]}
>>> df.expand_grid(df_key="df", others=data)
  df     z
   x  y  0
0  1  2  1
1  1  2  2
2  1  2  3
3  2  1  1
4  2  1  2
5  2  1  3

`expand_grid` works with non-pandas objects:

>>> data = {"x": [1, 2, 3], "y": [1, 2]}
>>> expand_grid(others=data)
   x  y
   0  0
0  1  1
1  1  2
2  2  1
3  2  2
4  3  1
5  3  2

Parameters:

Name Type Description Default
df Optional[DataFrame]

A pandas DataFrame.

None
df_key Optional[str]

Name of key for the dataframe. It becomes part of the column names of the dataframe.

None
others Optional[Dict]

A dictionary that contains the data to be combined with the dataframe. If no dataframe exists, all inputs in others will be combined to create a DataFrame.

None

Raises:

Type Description
KeyError

If there is a DataFrame and df_key is not provided.

Returns:

Type Description
Union[DataFrame, None]

A pandas DataFrame of the cartesian product.

Union[DataFrame, None]

If df is not provided, and others is not provided,

Union[DataFrame, None]

None is returned.

Source code in janitor/functions/expand_grid.py
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
@pf.register_dataframe_method
def expand_grid(
    df: Optional[pd.DataFrame] = None,
    df_key: Optional[str] = None,
    *,
    others: Optional[Dict] = None,
) -> Union[pd.DataFrame, None]:
    """Creates a DataFrame from a cartesian combination of all inputs.

    It is not restricted to DataFrame;
    it can work with any list-like structure
    that is 1 or 2 dimensional.

    If method-chaining to a DataFrame, a string argument
    to `df_key` parameter must be provided.

    Data types are preserved in this function,
    including pandas' extension array dtypes.

    The output will always be a DataFrame, usually with a MultiIndex column,
    with the keys of the `others` dictionary serving as the top level columns.

    If a pandas Series/DataFrame is passed, and has a labeled index, or
    a MultiIndex index, the index is discarded; the final DataFrame
    will have a RangeIndex.

    The MultiIndexed DataFrame can be flattened using pyjanitor's
    [`collapse_levels`][janitor.functions.collapse_levels.collapse_levels]
    method; the user can also decide to drop any of the levels, via pandas'
    `droplevel` method.

    Examples:

        >>> import pandas as pd
        >>> from janitor.functions.expand_grid import expand_grid
        >>> df = pd.DataFrame({"x": [1, 2], "y": [2, 1]})
        >>> data = {"z": [1, 2, 3]}
        >>> df.expand_grid(df_key="df", others=data)
          df     z
           x  y  0
        0  1  2  1
        1  1  2  2
        2  1  2  3
        3  2  1  1
        4  2  1  2
        5  2  1  3

        `expand_grid` works with non-pandas objects:

        >>> data = {"x": [1, 2, 3], "y": [1, 2]}
        >>> expand_grid(others=data)
           x  y
           0  0
        0  1  1
        1  1  2
        2  2  1
        3  2  2
        4  3  1
        5  3  2

    Args:
        df: A pandas DataFrame.
        df_key: Name of key for the dataframe.
            It becomes part of the column names of the dataframe.
        others: A dictionary that contains the data
            to be combined with the dataframe.
            If no dataframe exists, all inputs
            in `others` will be combined to create a DataFrame.

    Raises:
        KeyError: If there is a DataFrame and `df_key` is not provided.

    Returns:
        A pandas DataFrame of the cartesian product.
        If `df` is not provided, and `others` is not provided,
        None is returned.
    """

    if df is not None:
        check("df", df, [pd.DataFrame])
        if not df_key:
            raise KeyError(
                "Using `expand_grid` as part of a "
                "DataFrame method chain requires that "
                "a string argument be provided for "
                "the `df_key` parameter. "
            )

        check("df_key", df_key, [str])

    if not others and (df is not None):
        return df

    if not others:
        return None

    check("others", others, [dict])

    for key in others:
        check("key", key, [str])

    if df is not None:
        others = {**{df_key: df}, **others}

    others = _computations_expand_grid(others)
    return pd.DataFrame(others, copy=False)

explode_index

Implementation of the explode_index function.

explode_index(df, names_sep=None, names_pattern=None, axis='columns', level_names=None)

Explode a single index DataFrame into a MultiIndex DataFrame.

This method does not mutate the original DataFrame.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame(
...          {'max_speed_mean': [267.3333333333333, 50.5],
...           'max_speed_median': [389.0, 50.5]})
>>> df
   max_speed_mean  max_speed_median
0      267.333333             389.0
1       50.500000              50.5
>>> df.explode_index(names_sep='_',axis='columns')
          max
        speed
         mean median
0  267.333333  389.0
1   50.500000   50.5
>>> df.explode_index(names_pattern=r"(.+speed)_(.+)",axis='columns')
    max_speed
         mean median
0  267.333333  389.0
1   50.500000   50.5
>>> df.explode_index(
...     names_pattern=r"(?P<measurement>.+speed)_(?P<aggregation>.+)",
...     axis='columns'
... )
measurement   max_speed
aggregation        mean median
0            267.333333  389.0
1             50.500000   50.5
>>> df.explode_index(
...     names_sep='_',
...     axis='columns',
...     level_names = ['min or max', 'measurement','aggregation']
... )
min or max          max
measurement       speed
aggregation        mean median
0            267.333333  389.0
1             50.500000   50.5

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
names_sep Union[str, None]

string or compiled regex used to split the column/index into levels.

None
names_pattern Union[str, None]

regex to extract new levels from the column/index.

None
axis str

'index/columns'. Determines which axis to explode.

'columns'
level_names list

names of the levels in the MultiIndex.

None

Returns:

Type Description
DataFrame

A pandas DataFrame with a MultiIndex.

Source code in janitor/functions/explode_index.py
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
@pf.register_dataframe_method
def explode_index(
    df: pd.DataFrame,
    names_sep: Union[str, None] = None,
    names_pattern: Union[str, None] = None,
    axis: str = "columns",
    level_names: list = None,
) -> pd.DataFrame:
    """Explode a single index DataFrame into a MultiIndex DataFrame.

    This method does not mutate the original DataFrame.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame(
        ...          {'max_speed_mean': [267.3333333333333, 50.5],
        ...           'max_speed_median': [389.0, 50.5]})
        >>> df
           max_speed_mean  max_speed_median
        0      267.333333             389.0
        1       50.500000              50.5
        >>> df.explode_index(names_sep='_',axis='columns')  # doctest: +NORMALIZE_WHITESPACE
                  max
                speed
                 mean median
        0  267.333333  389.0
        1   50.500000   50.5
        >>> df.explode_index(names_pattern=r"(.+speed)_(.+)",axis='columns') # doctest: +NORMALIZE_WHITESPACE
            max_speed
                 mean median
        0  267.333333  389.0
        1   50.500000   50.5
        >>> df.explode_index(
        ...     names_pattern=r"(?P<measurement>.+speed)_(?P<aggregation>.+)",
        ...     axis='columns'
        ... ) # doctest: +NORMALIZE_WHITESPACE
        measurement   max_speed
        aggregation        mean median
        0            267.333333  389.0
        1             50.500000   50.5
        >>> df.explode_index(
        ...     names_sep='_',
        ...     axis='columns',
        ...     level_names = ['min or max', 'measurement','aggregation']
        ... ) # doctest: +NORMALIZE_WHITESPACE
        min or max          max
        measurement       speed
        aggregation        mean median
        0            267.333333  389.0
        1             50.500000   50.5

    Args:
        df: A pandas DataFrame.
        names_sep: string or compiled regex used to split the column/index into levels.
        names_pattern: regex to extract new levels from the column/index.
        axis: 'index/columns'. Determines which axis to explode.
        level_names: names of the levels in the MultiIndex.

    Returns:
        A pandas DataFrame with a MultiIndex.
    """  # noqa: E501
    check("axis", axis, [str])
    if axis not in {"index", "columns"}:
        raise ValueError("axis should be either index or columns.")
    if (names_sep is None) and (names_pattern is None):
        raise ValueError(
            "Provide argument for either names_sep or names_pattern."
        )
    if (names_sep is not None) and (names_pattern is not None):
        raise ValueError(
            "Provide argument for either names_sep or names_pattern, not both."
        )
    if names_sep is not None:
        check("names_sep", names_sep, [str])
    if names_pattern is not None:
        check("names_pattern", names_pattern, [str])
    if level_names is not None:
        check("level_names", level_names, [list])

    new_index = getattr(df, axis)
    if isinstance(new_index, pd.MultiIndex):
        return df
    # avoid a copy - Index is immutable; a slice is safe to use.
    df = df[:]
    if names_sep:
        new_index = new_index.str.split(names_sep, expand=True)
    else:
        named_groups = re.compile(names_pattern).groupindex
        if named_groups and not level_names:
            level_names = list(named_groups)
        new_index = new_index.str.extract(names_pattern)
        new_index = [arr.array for _, arr in new_index.items()]
        new_index = pd.MultiIndex.from_arrays(new_index)
    if level_names:
        new_index.names = level_names

    setattr(df, axis, new_index)
    return df

factorize_columns

Implementation of the factorize_columns function

factorize_columns(df, column_names, suffix='_enc', **kwargs)

Converts labels into numerical data.

This method will create a new column with the string _enc appended after the original column's name. This can be overridden with the suffix parameter.

Internally, this method uses pandas factorize method. It takes in an optional suffix and keyword arguments also. An empty string as suffix will override the existing column.

This method does not mutate the original DataFrame.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "foo": ["b", "b", "a", "c", "b"],
...     "bar": range(4, 9),
... })
>>> df
  foo  bar
0   b    4
1   b    5
2   a    6
3   c    7
4   b    8
>>> df.factorize_columns(column_names="foo")
  foo  bar  foo_enc
0   b    4        0
1   b    5        0
2   a    6        1
3   c    7        2
4   b    8        0

Parameters:

Name Type Description Default
df DataFrame

The pandas DataFrame object.

required
column_names Union[str, Iterable[str], Hashable]

A column name or an iterable (list or tuple) of column names.

required
suffix str

Suffix to be used for the new column. An empty string suffix means, it will override the existing column.

'_enc'
**kwargs Any

Keyword arguments. It takes any of the keyword arguments, which the pandas factorize method takes like sort, na_sentinel, size_hint.

{}

Returns:

Type Description
DataFrame

A pandas DataFrame.

Source code in janitor/functions/factorize_columns.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
@pf.register_dataframe_method
def factorize_columns(
    df: pd.DataFrame,
    column_names: Union[str, Iterable[str], Hashable],
    suffix: str = "_enc",
    **kwargs: Any,
) -> pd.DataFrame:
    """Converts labels into numerical data.

    This method will create a new column with the string `_enc` appended
    after the original column's name.
    This can be overridden with the suffix parameter.

    Internally, this method uses pandas `factorize` method.
    It takes in an optional suffix and keyword arguments also.
    An empty string as suffix will override the existing column.

    This method does not mutate the original DataFrame.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "foo": ["b", "b", "a", "c", "b"],
        ...     "bar": range(4, 9),
        ... })
        >>> df
          foo  bar
        0   b    4
        1   b    5
        2   a    6
        3   c    7
        4   b    8
        >>> df.factorize_columns(column_names="foo")
          foo  bar  foo_enc
        0   b    4        0
        1   b    5        0
        2   a    6        1
        3   c    7        2
        4   b    8        0

    Args:
        df: The pandas DataFrame object.
        column_names: A column name or an iterable (list or tuple) of
            column names.
        suffix: Suffix to be used for the new column.
            An empty string suffix means, it will override the existing column.
        **kwargs: Keyword arguments. It takes any of the keyword arguments,
            which the pandas factorize method takes like `sort`, `na_sentinel`,
            `size_hint`.

    Returns:
        A pandas DataFrame.
    """
    df = _factorize(df.copy(), column_names, suffix, **kwargs)
    return df

fill

fill_direction(df, **kwargs)

Provide a method-chainable function for filling missing values in selected columns.

It is a wrapper for pd.Series.ffill and pd.Series.bfill, and pairs the column name with one of up, down, updown, and downup.

Note

This function will be deprecated in a 1.x release. Please use pd.DataFrame.assign instead.

Examples:

>>> import pandas as pd
>>> import janitor as jn
>>> df = pd.DataFrame(
...    {
...        'col1': [1, 2, 3, 4],
...        'col2': [None, 5, 6, 7],
...        'col3': [8, 9, 10, None],
...        'col4': [None, None, 11, None],
...        'col5': [None, 12, 13, None]
...    }
... )
>>> df
   col1  col2  col3  col4  col5
0     1   NaN   8.0   NaN   NaN
1     2   5.0   9.0   NaN  12.0
2     3   6.0  10.0  11.0  13.0
3     4   7.0   NaN   NaN   NaN
>>> df.fill_direction(
... col2 = 'up',
... col3 = 'down',
... col4 = 'downup',
... col5 = 'updown'
... )
   col1  col2  col3  col4  col5
0     1   5.0   8.0  11.0  12.0
1     2   5.0   9.0  11.0  12.0
2     3   6.0  10.0  11.0  13.0
3     4   7.0  10.0  11.0  13.0

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
**kwargs Any

Key - value pairs of columns and directions. Directions can be either down, up, updown (fill up then down) and downup (fill down then up).

{}

Raises:

Type Description
ValueError

If direction supplied is not one of down, up, updown, or downup.

Returns:

Type Description
DataFrame

A pandas DataFrame with modified column(s).

Source code in janitor/functions/fill.py
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
@pf.register_dataframe_method
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `pd.DataFrame.assign` instead."
    )
)
def fill_direction(df: pd.DataFrame, **kwargs: Any) -> pd.DataFrame:
    """Provide a method-chainable function for filling missing values
    in selected columns.

    It is a wrapper for `pd.Series.ffill` and `pd.Series.bfill`,
    and pairs the column name with one of `up`, `down`, `updown`,
    and `downup`.

    !!!note

        This function will be deprecated in a 1.x release.
        Please use `pd.DataFrame.assign` instead.

    Examples:
        >>> import pandas as pd
        >>> import janitor as jn
        >>> df = pd.DataFrame(
        ...    {
        ...        'col1': [1, 2, 3, 4],
        ...        'col2': [None, 5, 6, 7],
        ...        'col3': [8, 9, 10, None],
        ...        'col4': [None, None, 11, None],
        ...        'col5': [None, 12, 13, None]
        ...    }
        ... )
        >>> df
           col1  col2  col3  col4  col5
        0     1   NaN   8.0   NaN   NaN
        1     2   5.0   9.0   NaN  12.0
        2     3   6.0  10.0  11.0  13.0
        3     4   7.0   NaN   NaN   NaN
        >>> df.fill_direction(
        ... col2 = 'up',
        ... col3 = 'down',
        ... col4 = 'downup',
        ... col5 = 'updown'
        ... )
           col1  col2  col3  col4  col5
        0     1   5.0   8.0  11.0  12.0
        1     2   5.0   9.0  11.0  12.0
        2     3   6.0  10.0  11.0  13.0
        3     4   7.0  10.0  11.0  13.0

    Args:
        df: A pandas DataFrame.
        **kwargs: Key - value pairs of columns and directions.
            Directions can be either `down`, `up`, `updown`
            (fill up then down) and `downup` (fill down then up).

    Raises:
        ValueError: If direction supplied is not one of `down`, `up`,
            `updown`, or `downup`.

    Returns:
        A pandas DataFrame with modified column(s).
    """  # noqa: E501

    if not kwargs:
        return df

    fill_types = {fill.name for fill in _FILLTYPE}
    for column_name, fill_type in kwargs.items():
        check("column_name", column_name, [str])
        check("fill_type", fill_type, [str])
        if fill_type.upper() not in fill_types:
            raise ValueError(
                "fill_type should be one of up, down, updown, or downup."
            )

    check_column(df, kwargs)

    new_values = {}
    for column_name, fill_type in kwargs.items():
        direction = _FILLTYPE[f"{fill_type.upper()}"].value
        if len(direction) == 1:
            direction = methodcaller(direction[0])
            output = direction(df[column_name])
        else:
            direction = [methodcaller(entry) for entry in direction]
            output = _chain_func(df[column_name], *direction)
        new_values[column_name] = output

    return df.assign(**new_values)

fill_empty(df, column_names, value)

Fill NaN values in specified columns with a given value.

Super sugary syntax that wraps pandas.DataFrame.fillna.

This method mutates the original DataFrame.

Note

This function will be deprecated in a 1.x release. Please use jn.impute instead.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame(
...        {
...            'col1': [1, 2, 3],
...            'col2': [None, 4, None ],
...            'col3': [None, 5, 6]
...        }
...    )
>>> df
   col1  col2  col3
0     1   NaN   NaN
1     2   4.0   5.0
2     3   NaN   6.0
>>> df.fill_empty(column_names = 'col2', value = 0)
   col1  col2  col3
0     1   0.0   NaN
1     2   4.0   5.0
2     3   0.0   6.0
>>> df.fill_empty(column_names = ['col2', 'col3'], value = 0)
   col1  col2  col3
0     1   0.0   0.0
1     2   4.0   5.0
2     3   0.0   6.0

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_names Union[str, Iterable[str], Hashable]

A column name or an iterable (list or tuple) of column names. If a single column name is passed in, then only that column will be filled; if a list or tuple is passed in, then those columns will all be filled with the same value.

required
value Any

The value that replaces the NaN values.

required

Returns:

Type Description
DataFrame

A pandas DataFrame with NaN values filled.

Source code in janitor/functions/fill.py
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
@pf.register_dataframe_method
@refactored_function(
    message="This function will be deprecated in a 1.x release. "
    "Kindly use `jn.impute` instead."
)
@deprecated_alias(columns="column_names")
def fill_empty(
    df: pd.DataFrame,
    column_names: Union[str, Iterable[str], Hashable],
    value: Any,
) -> pd.DataFrame:
    """Fill `NaN` values in specified columns with a given value.

    Super sugary syntax that wraps `pandas.DataFrame.fillna`.

    This method mutates the original DataFrame.

    !!!note

        This function will be deprecated in a 1.x release.
        Please use [`jn.impute`][janitor.functions.impute.impute] instead.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame(
        ...        {
        ...            'col1': [1, 2, 3],
        ...            'col2': [None, 4, None ],
        ...            'col3': [None, 5, 6]
        ...        }
        ...    )
        >>> df
           col1  col2  col3
        0     1   NaN   NaN
        1     2   4.0   5.0
        2     3   NaN   6.0
        >>> df.fill_empty(column_names = 'col2', value = 0)
           col1  col2  col3
        0     1   0.0   NaN
        1     2   4.0   5.0
        2     3   0.0   6.0
        >>> df.fill_empty(column_names = ['col2', 'col3'], value = 0)
           col1  col2  col3
        0     1   0.0   0.0
        1     2   4.0   5.0
        2     3   0.0   6.0

    Args:
        df: A pandas DataFrame.
        column_names: A column name or an iterable (list
            or tuple) of column names. If a single column name is passed in,
            then only that column will be filled; if a list or tuple is passed
            in, then those columns will all be filled with the same value.
        value: The value that replaces the `NaN` values.

    Returns:
        A pandas DataFrame with `NaN` values filled.
    """

    check_column(df, column_names)
    return _fill_empty(df, column_names, value=value)

filter

filter_column_isin(df, column_name, iterable, complement=False)

Filter a dataframe for values in a column that exist in the given iterable.

This method does not mutate the original DataFrame.

Assumes exact matching; fuzzy matching not implemented.

Examples:

Filter the dataframe to retain rows for which names are exactly James or John.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"names": ["Jane", "Jeremy", "John"], "foo": list("xyz")})
>>> df
    names foo
0    Jane   x
1  Jeremy   y
2    John   z
>>> df.filter_column_isin(column_name="names", iterable=["James", "John"])
  names foo
2  John   z

This is the method-chaining alternative to:

df = df[df["names"].isin(["James", "John"])]

If complement=True, then we will only get rows for which the names are neither James nor John.

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_name Hashable

The column on which to filter.

required
iterable Iterable

An iterable. Could be a list, tuple, another pandas Series.

required
complement bool

Whether to return the complement of the selection or not.

False

Raises:

Type Description
ValueError

If iterable does not have a length of 1 or greater.

Returns:

Type Description
DataFrame

A filtered pandas DataFrame.

Source code in janitor/functions/filter.py
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
@pf.register_dataframe_method
@deprecated_alias(column="column_name")
def filter_column_isin(
    df: pd.DataFrame,
    column_name: Hashable,
    iterable: Iterable,
    complement: bool = False,
) -> pd.DataFrame:
    """Filter a dataframe for values in a column that exist in the given iterable.

    This method does not mutate the original DataFrame.

    Assumes exact matching; fuzzy matching not implemented.

    Examples:
        Filter the dataframe to retain rows for which `names`
        are exactly `James` or `John`.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"names": ["Jane", "Jeremy", "John"], "foo": list("xyz")})
        >>> df
            names foo
        0    Jane   x
        1  Jeremy   y
        2    John   z
        >>> df.filter_column_isin(column_name="names", iterable=["James", "John"])
          names foo
        2  John   z

        This is the method-chaining alternative to:

        ```python
        df = df[df["names"].isin(["James", "John"])]
        ```

        If `complement=True`, then we will only get rows for which the names
        are neither `James` nor `John`.

    Args:
        df: A pandas DataFrame.
        column_name: The column on which to filter.
        iterable: An iterable. Could be a list, tuple, another pandas
            Series.
        complement: Whether to return the complement of the selection or
            not.

    Raises:
        ValueError: If `iterable` does not have a length of `1`
            or greater.

    Returns:
        A filtered pandas DataFrame.
    """  # noqa: E501
    if len(iterable) == 0:
        raise ValueError(
            "`iterable` kwarg must be given an iterable of length 1 "
            "or greater."
        )
    criteria = df[column_name].isin(iterable)

    if complement:
        return df[~criteria]
    return df[criteria]

filter_date(df, column_name, start_date=None, end_date=None, years=None, months=None, days=None, column_date_options=None, format=None)

Filter a date-based column based on certain criteria.

This method does not mutate the original DataFrame.

Dates may be finicky and this function builds on top of the magic from the pandas to_datetime function that is able to parse dates well.

Additional options to parse the date type of your column may be found at the official pandas documentation.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "a": range(5, 9),
...     "dt": ["2021-11-12", "2021-12-15", "2022-01-03", "2022-01-09"],
... })
>>> df
   a          dt
0  5  2021-11-12
1  6  2021-12-15
2  7  2022-01-03
3  8  2022-01-09
>>> df.filter_date("dt", start_date="2021-12-01", end_date="2022-01-05")
   a         dt
1  6 2021-12-15
2  7 2022-01-03
>>> df.filter_date("dt", years=[2021], months=[12])
   a         dt
1  6 2021-12-15

Note

This method will cast your column to a Timestamp!

Note

This only affects the format of the start_date and end_date parameters. If there's an issue with the format of the DataFrame being parsed, you would pass {'format': your_format} to column_date_options.

Parameters:

Name Type Description Default
df DataFrame

The dataframe to filter on.

required
column_name Hashable

The column which to apply the fraction transformation.

required
start_date Optional[date]

The beginning date to use to filter the DataFrame.

None
end_date Optional[date]

The end date to use to filter the DataFrame.

None
years Optional[List]

The years to use to filter the DataFrame.

None
months Optional[List]

The months to use to filter the DataFrame.

None
days Optional[List]

The days to use to filter the DataFrame.

None
column_date_options Optional[Dict]

Special options to use when parsing the date column in the original DataFrame. The options may be found at the official Pandas documentation.

None
format Optional[str]

If you're using a format for start_date or end_date that is not recognized natively by pandas' to_datetime function, you may supply the format yourself. Python date and time formats may be found here.

None

Returns:

Type Description
DataFrame

A filtered pandas DataFrame.

Source code in janitor/functions/filter.py
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
@pf.register_dataframe_method
@deprecated_alias(column="column_name", start="start_date", end="end_date")
def filter_date(
    df: pd.DataFrame,
    column_name: Hashable,
    start_date: Optional[dt.date] = None,
    end_date: Optional[dt.date] = None,
    years: Optional[List] = None,
    months: Optional[List] = None,
    days: Optional[List] = None,
    column_date_options: Optional[Dict] = None,
    format: Optional[str] = None,  # skipcq: PYL-W0622
) -> pd.DataFrame:
    """Filter a date-based column based on certain criteria.

    This method does not mutate the original DataFrame.

    Dates may be finicky and this function builds on top of the *magic* from
    the pandas `to_datetime` function that is able to parse dates well.

    Additional options to parse the date type of your column may be found at
    the official pandas [documentation][datetime].

    [datetime]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "a": range(5, 9),
        ...     "dt": ["2021-11-12", "2021-12-15", "2022-01-03", "2022-01-09"],
        ... })
        >>> df
           a          dt
        0  5  2021-11-12
        1  6  2021-12-15
        2  7  2022-01-03
        3  8  2022-01-09
        >>> df.filter_date("dt", start_date="2021-12-01", end_date="2022-01-05")
           a         dt
        1  6 2021-12-15
        2  7 2022-01-03
        >>> df.filter_date("dt", years=[2021], months=[12])
           a         dt
        1  6 2021-12-15

    !!!note

        This method will cast your column to a Timestamp!

    !!!note

        This only affects the format of the `start_date` and `end_date`
        parameters. If there's an issue with the format of the DataFrame being
        parsed, you would pass `{'format': your_format}` to `column_date_options`.

    Args:
        df: The dataframe to filter on.
        column_name: The column which to apply the fraction transformation.
        start_date: The beginning date to use to filter the DataFrame.
        end_date: The end date to use to filter the DataFrame.
        years: The years to use to filter the DataFrame.
        months: The months to use to filter the DataFrame.
        days: The days to use to filter the DataFrame.
        column_date_options: Special options to use when parsing the date
            column in the original DataFrame. The options may be found at the
            official Pandas documentation.
        format: If you're using a format for `start_date` or `end_date`
            that is not recognized natively by pandas' `to_datetime` function, you
            may supply the format yourself. Python date and time formats may be
            found [here](http://strftime.org/).

    Returns:
        A filtered pandas DataFrame.
    """  # noqa: E501

    def _date_filter_conditions(conditions):
        """Taken from: https://stackoverflow.com/a/13616382."""
        return reduce(np.logical_and, conditions)

    if column_date_options is None:
        column_date_options = {}
    df[column_name] = pd.to_datetime(df[column_name], **column_date_options)

    _filter_list = []

    if start_date:
        start_date = pd.to_datetime(start_date, format=format)
        _filter_list.append(df[column_name] >= start_date)

    if end_date:
        end_date = pd.to_datetime(end_date, format=format)
        _filter_list.append(df[column_name] <= end_date)

    if years:
        _filter_list.append(df[column_name].dt.year.isin(years))

    if months:
        _filter_list.append(df[column_name].dt.month.isin(months))

    if days:
        _filter_list.append(df[column_name].dt.day.isin(days))

    if start_date and end_date and start_date > end_date:
        warnings.warn(
            f"Your start date of {start_date} is after your end date of "
            f"{end_date}. Is this intended?"
        )

    return df.loc[_date_filter_conditions(_filter_list), :]

filter_on(df, criteria, complement=False)

Return a dataframe filtered on a particular criteria.

This method does not mutate the original DataFrame.

This is super-sugary syntax that wraps the pandas .query() API, enabling users to use strings to quickly specify filters for filtering their dataframe. The intent is that filter_on as a verb better matches the intent of a pandas user than the verb query.

This is intended to be the method-chaining equivalent of the following:

df = df[df["score"] < 3]

Note

This function will be deprecated in a 1.x release. Please use pd.DataFrame.query instead.

Examples:

Filter students who failed an exam (scored less than 50).

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "student_id": ["S1", "S2", "S3"],
...     "score": [40, 60, 85],
... })
>>> df
  student_id  score
0         S1     40
1         S2     60
2         S3     85
>>> df.filter_on("score < 50", complement=False)
  student_id  score
0         S1     40

Credit to Brant Peterson for the name.

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
criteria str

A filtering criteria that returns an array or Series of booleans, on which pandas can filter on.

required
complement bool

Whether to return the complement of the filter or not. If set to True, then the rows for which the criteria is False are retained instead.

False

Returns:

Type Description
DataFrame

A filtered pandas DataFrame.

Source code in janitor/functions/filter.py
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
@pf.register_dataframe_method
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `pd.DataFrame.query` instead."
    )
)
def filter_on(
    df: pd.DataFrame,
    criteria: str,
    complement: bool = False,
) -> pd.DataFrame:
    """Return a dataframe filtered on a particular criteria.

    This method does not mutate the original DataFrame.

    This is super-sugary syntax that wraps the pandas `.query()` API, enabling
    users to use strings to quickly specify filters for filtering their
    dataframe. The intent is that `filter_on` as a verb better matches the
    intent of a pandas user than the verb `query`.

    This is intended to be the method-chaining equivalent of the following:

    ```python
    df = df[df["score"] < 3]
    ```

    !!!note

        This function will be deprecated in a 1.x release.
        Please use `pd.DataFrame.query` instead.


    Examples:
        Filter students who failed an exam (scored less than 50).

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "student_id": ["S1", "S2", "S3"],
        ...     "score": [40, 60, 85],
        ... })
        >>> df
          student_id  score
        0         S1     40
        1         S2     60
        2         S3     85
        >>> df.filter_on("score < 50", complement=False)
          student_id  score
        0         S1     40

    Credit to Brant Peterson for the name.

    Args:
        df: A pandas DataFrame.
        criteria: A filtering criteria that returns an array or Series of
            booleans, on which pandas can filter on.
        complement: Whether to return the complement of the filter or not.
            If set to True, then the rows for which the criteria is False are
            retained instead.

    Returns:
        A filtered pandas DataFrame.
    """

    warnings.warn(
        "This function will be deprecated in a 1.x release. "
        "Kindly use `pd.DataFrame.query` instead.",
        DeprecationWarning,
        stacklevel=find_stack_level(),
    )

    if complement:
        return df.query(f"not ({criteria})")
    return df.query(criteria)

filter_string(df, column_name, search_string, complement=False, case=True, flags=0, na=None, regex=True)

Filter a string-based column according to whether it contains a substring.

This is super sugary syntax that builds on top of pandas.Series.str.contains. It is meant to be the method-chaining equivalent of the following:

df = df[df[column_name].str.contains(search_string)]]

This method does not mutate the original DataFrame.

Examples:

Retain rows whose column values contain a particular substring.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": range(3, 6), "b": ["bear", "peeL", "sail"]})
>>> df
   a     b
0  3  bear
1  4  peeL
2  5  sail
>>> df.filter_string(column_name="b", search_string="ee")
   a     b
1  4  peeL
>>> df.filter_string(column_name="b", search_string="L", case=False)
   a     b
1  4  peeL
2  5  sail

Filter names does not contain '.' (disable regex mode).

>>> import pandas as pd
>>> import janitor
>>> df = pd.Series(["JoseChen", "Brian.Salvi"], name="Name").to_frame()
>>> df
          Name
0     JoseChen
1  Brian.Salvi
>>> df.filter_string(column_name="Name", search_string=".", regex=False, complement=True)
       Name
0  JoseChen

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_name Hashable

The column to filter. The column should contain strings.

required
search_string str

A regex pattern or a (sub-)string to search.

required
complement bool

Whether to return the complement of the filter or not. If set to True, then the rows for which the string search fails are retained instead.

False
case bool

If True, case sensitive.

True
flags int

Flags to pass through to the re module, e.g. re.IGNORECASE.

0
na Any

Fill value for missing values. The default depends on dtype of the array. For object-dtype, numpy.nan is used. For StringDtype, pandas.NA is used.

None
regex bool

If True, assumes search_string is a regular expression. If False, treats the search_string as a literal string.

True

Returns:

Type Description
DataFrame

A filtered pandas DataFrame.

Source code in janitor/functions/filter.py
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
@pf.register_dataframe_method
@deprecated_alias(column="column_name")
def filter_string(
    df: pd.DataFrame,
    column_name: Hashable,
    search_string: str,
    complement: bool = False,
    case: bool = True,
    flags: int = 0,
    na: Any = None,
    regex: bool = True,
) -> pd.DataFrame:
    """Filter a string-based column according to whether it contains a substring.

    This is super sugary syntax that builds on top of `pandas.Series.str.contains`.
    It is meant to be the method-chaining equivalent of the following:

    ```python
    df = df[df[column_name].str.contains(search_string)]]
    ```

    This method does not mutate the original DataFrame.

    Examples:
        Retain rows whose column values contain a particular substring.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"a": range(3, 6), "b": ["bear", "peeL", "sail"]})
        >>> df
           a     b
        0  3  bear
        1  4  peeL
        2  5  sail
        >>> df.filter_string(column_name="b", search_string="ee")
           a     b
        1  4  peeL
        >>> df.filter_string(column_name="b", search_string="L", case=False)
           a     b
        1  4  peeL
        2  5  sail

        Filter names does not contain `'.'` (disable regex mode).

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.Series(["JoseChen", "Brian.Salvi"], name="Name").to_frame()
        >>> df
                  Name
        0     JoseChen
        1  Brian.Salvi
        >>> df.filter_string(column_name="Name", search_string=".", regex=False, complement=True)
               Name
        0  JoseChen

    Args:
        df: A pandas DataFrame.
        column_name: The column to filter. The column should contain strings.
        search_string: A regex pattern or a (sub-)string to search.
        complement: Whether to return the complement of the filter or not. If
            set to True, then the rows for which the string search fails are retained
            instead.
        case: If True, case sensitive.
        flags: Flags to pass through to the re module, e.g. re.IGNORECASE.
        na: Fill value for missing values. The default depends on dtype of
            the array. For object-dtype, `numpy.nan` is used. For `StringDtype`,
            `pandas.NA` is used.
        regex: If True, assumes `search_string` is a regular expression. If False,
            treats the `search_string` as a literal string.

    Returns:
        A filtered pandas DataFrame.
    """  # noqa: E501

    criteria = df[column_name].str.contains(
        pat=search_string,
        case=case,
        flags=flags,
        na=na,
        regex=regex,
    )

    if complement:
        return df[~criteria]

    return df[criteria]

find_replace

Implementation for find_replace.

find_replace(df, match='exact', **mappings)

Perform a find-and-replace action on provided columns.

Note

This function will be deprecated in a 1.x release. Please use pd.DataFrame.replace instead.

Depending on use case, users can choose either exact, full-value matching, or regular-expression-based fuzzy matching (hence allowing substring matching in the latter case). For strings, the matching is always case sensitive.

Examples:

For instance, given a DataFrame containing orders at a coffee shop:

>>> df = pd.DataFrame({
...     "customer": ["Mary", "Tom", "Lila"],
...     "order": ["ice coffee", "lemonade", "regular coffee"]
... })
>>> df
  customer           order
0     Mary      ice coffee
1      Tom        lemonade
2     Lila  regular coffee

Our task is to replace values ice coffee and regular coffee of the order column into latte.

Example 1 - exact matching (functional usage):

>>> df = find_replace(
...     df,
...     match="exact",
...     order={"ice coffee": "latte", "regular coffee": "latte"},
... )
>>> df
  customer     order
0     Mary     latte
1      Tom  lemonade
2     Lila     latte

Example 1 - exact matching (method chaining):

>>> df = df.find_replace(
...     match="exact",
...     order={"ice coffee": "latte", "regular coffee": "latte"},
... )
>>> df
  customer     order
0     Mary     latte
1      Tom  lemonade
2     Lila     latte

Example 2 - Regular-expression-based matching (functional usage):

>>> df = find_replace(
...     df,
...     match='regex',
...     order={'coffee$': 'latte'},
... )
>>> df
  customer     order
0     Mary     latte
1      Tom  lemonade
2     Lila     latte

Example 2 - Regular-expression-based matching (method chaining usage):

>>> df = df.find_replace(
...     match='regex',
...     order={'coffee$': 'latte'},
... )
>>> df
  customer     order
0     Mary     latte
1      Tom  lemonade
2     Lila     latte

To perform a find and replace on the entire DataFrame, pandas' df.replace() function provides the appropriate functionality. You can find more detail on the replace docs.

This function only works with column names that have no spaces or punctuation in them. For example, a column name item_name would work with find_replace, because it is a contiguous string that can be parsed correctly, but item name would not be parsed correctly by the Python interpreter.

If you have column names that might not be compatible, we recommend calling on clean_names() as the first method call. If, for whatever reason, that is not possible, then _find_replace is available as a function that you can do a pandas pipe call on.

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
match str

Whether or not to perform an exact match or not. Valid values are "exact" or "regex".

'exact'
**mappings Any

keyword arguments corresponding to column names that have dictionaries passed in indicating what to find (keys) and what to replace with (values).

{}

Returns:

Type Description
DataFrame

A pandas DataFrame with replaced values.

Source code in janitor/functions/find_replace.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
@pf.register_dataframe_method
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `pd.DataFrame.replace` instead."
    )
)
def find_replace(
    df: pd.DataFrame, match: str = "exact", **mappings: Any
) -> pd.DataFrame:
    """Perform a find-and-replace action on provided columns.

    !!!note

        This function will be deprecated in a 1.x release.
        Please use `pd.DataFrame.replace` instead.

    Depending on use case, users can choose either exact, full-value matching,
    or regular-expression-based fuzzy matching
    (hence allowing substring matching in the latter case).
    For strings, the matching is always case sensitive.

    Examples:
        For instance, given a DataFrame containing orders at a coffee shop:

        >>> df = pd.DataFrame({
        ...     "customer": ["Mary", "Tom", "Lila"],
        ...     "order": ["ice coffee", "lemonade", "regular coffee"]
        ... })
        >>> df
          customer           order
        0     Mary      ice coffee
        1      Tom        lemonade
        2     Lila  regular coffee

        Our task is to replace values `ice coffee` and `regular coffee`
        of the `order` column into `latte`.

        Example 1 - exact matching (functional usage):

        >>> df = find_replace(
        ...     df,
        ...     match="exact",
        ...     order={"ice coffee": "latte", "regular coffee": "latte"},
        ... )
        >>> df
          customer     order
        0     Mary     latte
        1      Tom  lemonade
        2     Lila     latte

        Example 1 - exact matching (method chaining):

        >>> df = df.find_replace(
        ...     match="exact",
        ...     order={"ice coffee": "latte", "regular coffee": "latte"},
        ... )
        >>> df
          customer     order
        0     Mary     latte
        1      Tom  lemonade
        2     Lila     latte

        Example 2 - Regular-expression-based matching (functional usage):

        >>> df = find_replace(
        ...     df,
        ...     match='regex',
        ...     order={'coffee$': 'latte'},
        ... )
        >>> df
          customer     order
        0     Mary     latte
        1      Tom  lemonade
        2     Lila     latte

        Example 2 - Regular-expression-based matching (method chaining usage):

        >>> df = df.find_replace(
        ...     match='regex',
        ...     order={'coffee$': 'latte'},
        ... )
        >>> df
          customer     order
        0     Mary     latte
        1      Tom  lemonade
        2     Lila     latte

    To perform a find and replace on the entire DataFrame,
    pandas' `df.replace()` function provides the appropriate functionality.
    You can find more detail on the [replace] docs.

    [replace]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html

    This function only works with column names that have no spaces
    or punctuation in them.
    For example, a column name `item_name` would work with `find_replace`,
    because it is a contiguous string that can be parsed correctly,
    but `item name` would not be parsed correctly by the Python interpreter.

    If you have column names that might not be compatible,
    we recommend calling on [`clean_names()`][janitor.functions.clean_names.clean_names]
    as the first method call. If, for whatever reason, that is not possible,
    then `_find_replace` is available as a function
    that you can do a pandas [pipe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pipe.html) call on.

    Args:
        df: A pandas DataFrame.
        match: Whether or not to perform an exact match or not.
            Valid values are "exact" or "regex".
        **mappings: keyword arguments corresponding to column names
            that have dictionaries passed in indicating what to find (keys)
            and what to replace with (values).

    Returns:
        A pandas DataFrame with replaced values.
    """  # noqa: E501
    for column_name, mapper in mappings.items():
        df = _find_replace(df, column_name, mapper, match=match)
    return df

flag_nulls

Implementation source for flag_nulls.

flag_nulls(df, column_name='null_flag', columns=None)

Creates a new column to indicate whether you have null values in a given row.

If the columns parameter is not set, looks across the entire DataFrame, otherwise will look only in the columns you set.

This method does not mutate the original DataFrame.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "a": ["w", "x", None, "z"], "b": [5, None, 7, 8],
... })
>>> df.flag_nulls()
      a    b  null_flag
0     w  5.0          0
1     x  NaN          1
2  None  7.0          1
3     z  8.0          0
>>> df.flag_nulls(columns="b")
      a    b  null_flag
0     w  5.0          0
1     x  NaN          1
2  None  7.0          0
3     z  8.0          0

Parameters:

Name Type Description Default
df DataFrame

Input pandas DataFrame.

required
column_name Optional[Hashable]

Name for the output column.

'null_flag'
columns Optional[Union[str, Iterable[str], Hashable]]

List of columns to look at for finding null values. If you only want to look at one column, you can simply give its name. If set to None (default), all DataFrame columns are used.

None

Raises:

Type Description
ValueError

If column_name is already present in the DataFrame.

ValueError

If any column within columns is not present in the DataFrame.

Returns:

Type Description
DataFrame

Input dataframe with the null flag column.

Source code in janitor/functions/flag_nulls.py
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
@pf.register_dataframe_method
def flag_nulls(
    df: pd.DataFrame,
    column_name: Optional[Hashable] = "null_flag",
    columns: Optional[Union[str, Iterable[str], Hashable]] = None,
) -> pd.DataFrame:
    """Creates a new column to indicate whether you have null values in a given
    row.

    If the columns parameter is not set, looks across the entire
    DataFrame, otherwise will look only in the columns you set.

    This method does not mutate the original DataFrame.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "a": ["w", "x", None, "z"], "b": [5, None, 7, 8],
        ... })
        >>> df.flag_nulls()
              a    b  null_flag
        0     w  5.0          0
        1     x  NaN          1
        2  None  7.0          1
        3     z  8.0          0
        >>> df.flag_nulls(columns="b")
              a    b  null_flag
        0     w  5.0          0
        1     x  NaN          1
        2  None  7.0          0
        3     z  8.0          0

    Args:
        df: Input pandas DataFrame.
        column_name: Name for the output column.
        columns: List of columns to look at for finding null values. If you
            only want to look at one column, you can simply give its name.
            If set to None (default), all DataFrame columns are used.

    Raises:
        ValueError: If `column_name` is already present in the
            DataFrame.
        ValueError: If any column within `columns` is not present in
            the DataFrame.

    Returns:
        Input dataframe with the null flag column.

    <!--
    # noqa: DAR402
    -->
    """
    # Sort out columns input
    if isinstance(columns, str):
        columns = [columns]
    elif columns is None:
        columns = df.columns
    elif not isinstance(columns, Iterable):
        # catches other hashable types
        columns = [columns]

    # Input sanitation checks
    check_column(df, columns)
    check_column(df, [column_name], present=False)

    # This algorithm works best for n_rows >> n_cols. See issue #501
    null_array = np.zeros(len(df))
    for col in columns:
        null_array = np.logical_or(null_array, pd.isna(df[col]))

    df = df.copy()
    df[column_name] = null_array.astype(int)
    return df

get_dupes

Implementation of the get_dupes function

get_dupes(df, column_names=None)

Return all duplicate rows.

This method does not mutate the original DataFrame.

Examples:

Method chaining syntax:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "item": ["shoe", "shoe", "bag", "shoe", "bag"],
...     "quantity": [100, 100, 75, 200, 75],
... })
>>> df
   item  quantity
0  shoe       100
1  shoe       100
2   bag        75
3  shoe       200
4   bag        75
>>> df.get_dupes()
   item  quantity
0  shoe       100
1  shoe       100
2   bag        75
4   bag        75

Optional column_names usage:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "item": ["shoe", "shoe", "bag", "shoe", "bag"],
...     "quantity": [100, 100, 75, 200, 75],
... })
>>> df
   item  quantity
0  shoe       100
1  shoe       100
2   bag        75
3  shoe       200
4   bag        75
>>> df.get_dupes(column_names=["item"])
   item  quantity
0  shoe       100
1  shoe       100
2   bag        75
3  shoe       200
4   bag        75
>>> df.get_dupes(column_names=["quantity"])
   item  quantity
0  shoe       100
1  shoe       100
2   bag        75
4   bag        75

Parameters:

Name Type Description Default
df DataFrame

The pandas DataFrame object.

required
column_names Optional[Union[str, Iterable[str], Hashable]]

A column name or an iterable (list or tuple) of column names. Following pandas API, this only considers certain columns for identifying duplicates. Defaults to using all columns.

None

Returns:

Type Description
DataFrame

The duplicate rows, as a pandas DataFrame.

Source code in janitor/functions/get_dupes.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
@pf.register_dataframe_method
@deprecated_alias(columns="column_names")
def get_dupes(
    df: pd.DataFrame,
    column_names: Optional[Union[str, Iterable[str], Hashable]] = None,
) -> pd.DataFrame:
    """
    Return all duplicate rows.

    This method does not mutate the original DataFrame.

    Examples:
        Method chaining syntax:

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "item": ["shoe", "shoe", "bag", "shoe", "bag"],
        ...     "quantity": [100, 100, 75, 200, 75],
        ... })
        >>> df
           item  quantity
        0  shoe       100
        1  shoe       100
        2   bag        75
        3  shoe       200
        4   bag        75
        >>> df.get_dupes()
           item  quantity
        0  shoe       100
        1  shoe       100
        2   bag        75
        4   bag        75

        Optional `column_names` usage:

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "item": ["shoe", "shoe", "bag", "shoe", "bag"],
        ...     "quantity": [100, 100, 75, 200, 75],
        ... })
        >>> df
           item  quantity
        0  shoe       100
        1  shoe       100
        2   bag        75
        3  shoe       200
        4   bag        75
        >>> df.get_dupes(column_names=["item"])
           item  quantity
        0  shoe       100
        1  shoe       100
        2   bag        75
        3  shoe       200
        4   bag        75
        >>> df.get_dupes(column_names=["quantity"])
           item  quantity
        0  shoe       100
        1  shoe       100
        2   bag        75
        4   bag        75

    Args:
        df: The pandas DataFrame object.
        column_names: A column name or an iterable
            (list or tuple) of column names. Following pandas API, this only
            considers certain columns for identifying duplicates. Defaults
            to using all columns.

    Returns:
        The duplicate rows, as a pandas DataFrame.
    """
    return df.loc[df.duplicated(subset=column_names, keep=False)]

groupby_agg

Implementation source for groupby_agg.

groupby_agg(df, by, new_column_name, agg_column_name, agg, dropna=True)

Shortcut for assigning a groupby-transform to a new column.

This method does not mutate the original DataFrame.

Intended to be the method-chaining equivalent of:

df = df.assign(...=df.groupby(...)[...].transform(...))

Note

This function will be deprecated in a 1.x release. Please use jn.transform_column instead.

Examples:

Basic usage.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "item": ["shoe", "shoe", "bag", "shoe", "bag"],
...     "quantity": [100, 120, 75, 200, 25],
... })
>>> df.groupby_agg(
...     by="item",
...     agg="mean",
...     agg_column_name="quantity",
...     new_column_name="avg_quantity",
... )
   item  quantity  avg_quantity
0  shoe       100         140.0
1  shoe       120         140.0
2   bag        75          50.0
3  shoe       200         140.0
4   bag        25          50.0

Set dropna=False to compute the aggregation, treating the null values in the by column as an isolated "group".

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "x": ["a", "a", None, "b"], "y": [9, 9, 9, 9],
... })
>>> df.groupby_agg(
...     by="x",
...     agg="count",
...     agg_column_name="y",
...     new_column_name="y_count",
...     dropna=False,
... )
      x  y  y_count
0     a  9        2
1     a  9        2
2  None  9        1
3     b  9        1

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
by Union[List, Callable, str]

Column(s) to groupby on, will be passed into DataFrame.groupby.

required
new_column_name str

Name of the aggregation output column.

required
agg_column_name str

Name of the column to aggregate over.

required
agg Union[Callable, str]

How to aggregate.

required
dropna bool

Whether or not to include null values, if present in the by column(s). Default is True (null values in by are assigned NaN in the new column).

True

Returns:

Type Description
DataFrame

A pandas DataFrame.

Source code in janitor/functions/groupby_agg.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
@pf.register_dataframe_method
@deprecated_alias(new_column="new_column_name", agg_column="agg_column_name")
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `janitor.transform_column` instead."
    )
)
def groupby_agg(
    df: pd.DataFrame,
    by: Union[List, Callable, str],
    new_column_name: str,
    agg_column_name: str,
    agg: Union[Callable, str],
    dropna: bool = True,
) -> pd.DataFrame:
    """Shortcut for assigning a groupby-transform to a new column.

    This method does not mutate the original DataFrame.

    Intended to be the method-chaining equivalent of:

    ```python
    df = df.assign(...=df.groupby(...)[...].transform(...))
    ```

    !!!note

        This function will be deprecated in a 1.x release.
        Please use
        [`jn.transform_column`][janitor.functions.transform_columns.transform_column]
        instead.

    Examples:
        Basic usage.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "item": ["shoe", "shoe", "bag", "shoe", "bag"],
        ...     "quantity": [100, 120, 75, 200, 25],
        ... })
        >>> df.groupby_agg(
        ...     by="item",
        ...     agg="mean",
        ...     agg_column_name="quantity",
        ...     new_column_name="avg_quantity",
        ... )
           item  quantity  avg_quantity
        0  shoe       100         140.0
        1  shoe       120         140.0
        2   bag        75          50.0
        3  shoe       200         140.0
        4   bag        25          50.0

        Set `dropna=False` to compute the aggregation, treating the null
        values in the `by` column as an isolated "group".

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "x": ["a", "a", None, "b"], "y": [9, 9, 9, 9],
        ... })
        >>> df.groupby_agg(
        ...     by="x",
        ...     agg="count",
        ...     agg_column_name="y",
        ...     new_column_name="y_count",
        ...     dropna=False,
        ... )
              x  y  y_count
        0     a  9        2
        1     a  9        2
        2  None  9        1
        3     b  9        1

    Args:
        df: A pandas DataFrame.
        by: Column(s) to groupby on, will be passed into `DataFrame.groupby`.
        new_column_name: Name of the aggregation output column.
        agg_column_name: Name of the column to aggregate over.
        agg: How to aggregate.
        dropna: Whether or not to include null values, if present in the
            `by` column(s). Default is True (null values in `by` are assigned NaN in
            the new column).

    Returns:
        A pandas DataFrame.
    """  # noqa: E501

    return df.assign(
        **{
            new_column_name: df.groupby(by, dropna=dropna)[
                agg_column_name
            ].transform(agg),
        }
    )

groupby_topk

Implementation of the groupby_topk function

groupby_topk(df, by, column, k, dropna=True, ascending=True, ignore_index=True)

Return top k rows from a groupby of a set of columns.

Returns a DataFrame that has the top k values per column, grouped by by. Under the hood it uses nlargest/nsmallest, for numeric columns, which avoids sorting the entire dataframe, and is usually more performant. For non-numeric columns, pd.sort_values is used. No sorting is done to the by column(s); the order is maintained in the final output.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame(
...     {
...         "age": [20, 23, 22, 43, 21],
...         "id": [1, 4, 6, 2, 5],
...         "result": ["pass", "pass", "fail", "pass", "fail"],
...     }
... )
>>> df
   age  id result
0   20   1   pass
1   23   4   pass
2   22   6   fail
3   43   2   pass
4   21   5   fail

Ascending top 3:

>>> df.groupby_topk(by="result", column="age", k=3)
   age  id result
0   20   1   pass
1   23   4   pass
2   43   2   pass
3   21   5   fail
4   22   6   fail

Descending top 2:

>>> df.groupby_topk(
...     by="result", column="age", k=2, ascending=False, ignore_index=False
... )
   age  id result
3   43   2   pass
1   23   4   pass
2   22   6   fail
4   21   5   fail

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
by Union[list, Hashable]

Column name(s) to group input DataFrame df by.

required
column Hashable

Name of the column that determines k rows to return.

required
k int

Number of top rows to return for each group.

required
dropna bool

If True, and NA values exist in by, the NA values are not used in the groupby computation to get the relevant k rows. If False, and NA values exist in by, then the NA values are used in the groupby computation to get the relevant k rows.

True
ascending bool

If True, the smallest top k rows, determined by column are returned; if False, the largest topkrows, determined bycolumn` are returned.

True
ignore_index bool

If True, the original index is ignored. If False, the original index for the top k rows is retained.

True

Raises:

Type Description
ValueError

If k is less than 1.

Returns:

Type Description
DataFrame

A pandas DataFrame with top k rows per column, grouped by by.

Source code in janitor/functions/groupby_topk.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
@pf.register_dataframe_method
@deprecated_alias(groupby_column_name="by", sort_column_name="column")
def groupby_topk(
    df: pd.DataFrame,
    by: Union[list, Hashable],
    column: Hashable,
    k: int,
    dropna: bool = True,
    ascending: bool = True,
    ignore_index: bool = True,
) -> pd.DataFrame:
    """Return top `k` rows from a groupby of a set of columns.

    Returns a DataFrame that has the top `k` values per `column`,
    grouped by `by`. Under the hood it uses `nlargest/nsmallest`,
    for numeric columns, which avoids sorting the entire dataframe,
    and is usually more performant. For non-numeric columns, `pd.sort_values`
    is used.
    No sorting is done to the `by` column(s); the order is maintained
    in the final output.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame(
        ...     {
        ...         "age": [20, 23, 22, 43, 21],
        ...         "id": [1, 4, 6, 2, 5],
        ...         "result": ["pass", "pass", "fail", "pass", "fail"],
        ...     }
        ... )
        >>> df
           age  id result
        0   20   1   pass
        1   23   4   pass
        2   22   6   fail
        3   43   2   pass
        4   21   5   fail

        Ascending top 3:

        >>> df.groupby_topk(by="result", column="age", k=3)
           age  id result
        0   20   1   pass
        1   23   4   pass
        2   43   2   pass
        3   21   5   fail
        4   22   6   fail

        Descending top 2:

        >>> df.groupby_topk(
        ...     by="result", column="age", k=2, ascending=False, ignore_index=False
        ... )
           age  id result
        3   43   2   pass
        1   23   4   pass
        2   22   6   fail
        4   21   5   fail

    Args:
        df: A pandas DataFrame.
        by: Column name(s) to group input DataFrame `df` by.
        column: Name of the column that determines `k` rows
            to return.
        k: Number of top rows to return for each group.
        dropna: If `True`, and `NA` values exist in `by`, the `NA`
            values are not used in the groupby computation to get the relevant
            `k` rows. If `False`, and `NA` values exist in `by`, then the `NA`
            values are used in the groupby computation to get the relevant
            `k` rows.
        ascending: If `True`, the smallest top `k` rows,
            determined by `column` are returned; if `False, the largest top `k`
            rows, determined by `column` are returned.
        ignore_index: If `True`, the original index is ignored.
            If `False`, the original index for the top `k` rows is retained.

    Raises:
        ValueError: If `k` is less than 1.

    Returns:
        A pandas DataFrame with top `k` rows per `column`, grouped by `by`.
    """  # noqa: E501

    if isinstance(by, Hashable):
        by = [by]

    check("by", by, [Hashable, list])

    check_column(df, [column])
    check_column(df, by)

    if k < 1:
        raise ValueError(
            "Numbers of rows per group "
            "to be returned must be greater than 0."
        )

    indices = df.groupby(by=by, dropna=dropna, sort=False, observed=True)
    indices = indices[column]

    try:
        if ascending:
            indices = indices.nsmallest(n=k)
        else:
            indices = indices.nlargest(n=k)
    except TypeError:
        indices = indices.apply(
            lambda d: d.sort_values(ascending=ascending).head(k)
        )

    indices = indices.index.get_level_values(-1)
    if ignore_index:
        return df.loc[indices].reset_index(drop=True)
    return df.loc[indices]

impute

Implementation of impute function

impute(df, column_names, value=None, statistic_column_name=None)

Method-chainable imputation of values in a column.

This method does not mutate the original DataFrame.

Underneath the hood, this function calls the .fillna() method available to every pandas.Series object.

Either one of value or statistic_column_name should be provided.

If value is provided, then all null values in the selected column will take on the value provided.

If statistic_column_name is provided, then all null values in the selected column(s) will take on the summary statistic value of other non-null values.

Column selection in column_names is possible using the select syntax.

Currently supported statistics include:

  • mean (also aliased by average)
  • median
  • mode
  • minimum (also aliased by min)
  • maximum (also aliased by max)

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "a": [1, 2, 3],
...     "sales": np.nan,
...     "score": [np.nan, 3, 2],
... })
>>> df
   a  sales  score
0  1    NaN    NaN
1  2    NaN    3.0
2  3    NaN    2.0

Imputing null values with 0 (using the value parameter):

>>> df.impute(column_names="sales", value=0.0)
   a  sales  score
0  1    0.0    NaN
1  2    0.0    3.0
2  3    0.0    2.0

Imputing null values with median (using the statistic_column_name parameter):

>>> df.impute(column_names="score", statistic_column_name="median")
   a  sales  score
0  1    NaN    2.5
1  2    NaN    3.0
2  3    NaN    2.0

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_names Any

The name of the column(s) on which to impute values.

required
value Optional[Any]

The value used for imputation, passed into .fillna method of the underlying pandas Series.

None
statistic_column_name Optional[str]

The column statistic to impute.

None

Raises:

Type Description
ValueError

If both value and statistic_column_name are provided.

KeyError

If statistic_column_name is not one of mean, average, median, mode, minimum, min, maximum, or max.

Returns:

Type Description
DataFrame

An imputed pandas DataFrame.

Source code in janitor/functions/impute.py
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
@pf.register_dataframe_method
@deprecated_alias(column="column_name")
@deprecated_alias(column_name="column_names")
@deprecated_alias(statistic="statistic_column_name")
def impute(
    df: pd.DataFrame,
    column_names: Any,
    value: Optional[Any] = None,
    statistic_column_name: Optional[str] = None,
) -> pd.DataFrame:
    """Method-chainable imputation of values in a column.

    This method does not mutate the original DataFrame.

    Underneath the hood, this function calls the `.fillna()` method available
    to every `pandas.Series` object.

    Either one of `value` or `statistic_column_name` should be provided.

    If `value` is provided, then all null values in the selected column will
    take on the value provided.

    If `statistic_column_name` is provided, then all null values in the
    selected column(s) will take on the summary statistic value
    of other non-null values.

    Column selection in `column_names` is possible using the
    [`select`][janitor.functions.select.select] syntax.

    Currently supported statistics include:

    - `mean` (also aliased by `average`)
    - `median`
    - `mode`
    - `minimum` (also aliased by `min`)
    - `maximum` (also aliased by `max`)

    Examples:
        >>> import numpy as np
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "a": [1, 2, 3],
        ...     "sales": np.nan,
        ...     "score": [np.nan, 3, 2],
        ... })
        >>> df
           a  sales  score
        0  1    NaN    NaN
        1  2    NaN    3.0
        2  3    NaN    2.0

        Imputing null values with 0 (using the `value` parameter):

        >>> df.impute(column_names="sales", value=0.0)
           a  sales  score
        0  1    0.0    NaN
        1  2    0.0    3.0
        2  3    0.0    2.0

        Imputing null values with median (using the `statistic_column_name`
        parameter):

        >>> df.impute(column_names="score", statistic_column_name="median")
           a  sales  score
        0  1    NaN    2.5
        1  2    NaN    3.0
        2  3    NaN    2.0

    Args:
        df: A pandas DataFrame.
        column_names: The name of the column(s) on which to impute values.
        value: The value used for imputation, passed into `.fillna` method
            of the underlying pandas Series.
        statistic_column_name: The column statistic to impute.

    Raises:
        ValueError: If both `value` and `statistic_column_name` are
            provided.
        KeyError: If `statistic_column_name` is not one of `mean`,
            `average`, `median`, `mode`, `minimum`, `min`, `maximum`, or
            `max`.

    Returns:
        An imputed pandas DataFrame.
    """
    # Firstly, we check that only one of `value` or `statistic` are provided.
    if (value is None) and (statistic_column_name is None):
        raise ValueError("Kindly specify a value or a statistic_column_name")

    if value is not None and statistic_column_name is not None:
        raise ValueError(
            "Only one of `value` or `statistic_column_name` should be "
            "provided."
        )

    column_names = get_index_labels([column_names], df, axis="columns")

    if value is not None:
        value = dict(product(column_names, [value]))

    else:
        # If statistic is provided, then we compute
        # the relevant summary statistic
        # from the other data.
        funcs = {
            "mean": "mean",
            "average": "mean",  # aliased
            "median": "median",
            "mode": "mode",
            "minimum": "min",
            "min": "min",  # aliased
            "maximum": "max",
            "max": "max",  # aliased
        }
        # Check that the statistic keyword argument is one of the approved.
        if statistic_column_name not in funcs:
            raise KeyError(
                f"`statistic_column_name` must be one of {funcs.keys()}."
            )

        value = dict(product(column_names, [funcs[statistic_column_name]]))

        value = df.agg(value)

        # special treatment for mode
        if statistic_column_name == "mode":
            value = {key: val.at[0] for key, val in value.items()}

    return df.fillna(value=value)

jitter

Implementation of the jitter function.

jitter(df, column_name, dest_column_name, scale, clip=None, random_state=None)

Adds Gaussian noise (jitter) to the values of a column.

A new column will be created containing the values of the original column with Gaussian noise added. For each value in the column, a Gaussian distribution is created having a location (mean) equal to the value and a scale (standard deviation) equal to scale. A random value is then sampled from this distribution, which is the jittered value. If a tuple is supplied for clip, then any values of the new column less than clip[0] will be set to clip[0], and any values greater than clip[1] will be set to clip[1]. Additionally, if a numeric value is supplied for random_state, this value will be used to set the random seed used for sampling. NaN values are ignored in this method.

This method mutates the original DataFrame.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": [3, 4, 5, np.nan]})
>>> df
     a
0  3.0
1  4.0
2  5.0
3  NaN
>>> df.jitter("a", dest_column_name="a_jit", scale=1, random_state=42)
     a     a_jit
0  3.0  3.496714
1  4.0  3.861736
2  5.0  5.647689
3  NaN       NaN

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_name Hashable

Name of the column containing values to add Gaussian jitter to.

required
dest_column_name str

The name of the new column containing the jittered values that will be created.

required
scale number

A positive value multiplied by the original column value to determine the scale (standard deviation) of the Gaussian distribution to sample from. (A value of zero results in no jittering.)

required
clip Optional[Iterable[number]]

An iterable of two values (minimum and maximum) to clip the jittered values to, default to None.

None
random_state Optional[number]

An integer or 1-d array value used to set the random seed, default to None.

None

Raises:

Type Description
TypeError

If column_name is not numeric.

ValueError

If scale is not a numerical value greater than 0.

ValueError

If clip is not an iterable of length 2.

ValueError

If clip[0] is greater than clip[1].

Returns:

Type Description
DataFrame

A pandas DataFrame with a new column containing Gaussian-jittered values from another column.

Source code in janitor/functions/jitter.py
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
@pf.register_dataframe_method
def jitter(
    df: pd.DataFrame,
    column_name: Hashable,
    dest_column_name: str,
    scale: np.number,
    clip: Optional[Iterable[np.number]] = None,
    random_state: Optional[np.number] = None,
) -> pd.DataFrame:
    """Adds Gaussian noise (jitter) to the values of a column.

    A new column will be created containing the values of the original column
    with Gaussian noise added.
    For each value in the column, a Gaussian distribution is created
    having a location (mean) equal to the value
    and a scale (standard deviation) equal to `scale`.
    A random value is then sampled from this distribution,
    which is the jittered value.
    If a tuple is supplied for `clip`,
    then any values of the new column less than `clip[0]`
    will be set to `clip[0]`,
    and any values greater than `clip[1]` will be set to `clip[1]`.
    Additionally, if a numeric value is supplied for `random_state`,
    this value will be used to set the random seed used for sampling.
    NaN values are ignored in this method.

    This method mutates the original DataFrame.

    Examples:
        >>> import numpy as np
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"a": [3, 4, 5, np.nan]})
        >>> df
             a
        0  3.0
        1  4.0
        2  5.0
        3  NaN
        >>> df.jitter("a", dest_column_name="a_jit", scale=1, random_state=42)
             a     a_jit
        0  3.0  3.496714
        1  4.0  3.861736
        2  5.0  5.647689
        3  NaN       NaN

    Args:
        df: A pandas DataFrame.
        column_name: Name of the column containing
            values to add Gaussian jitter to.
        dest_column_name: The name of the new column containing the
            jittered values that will be created.
        scale: A positive value multiplied by the original
            column value to determine the scale (standard deviation) of the
            Gaussian distribution to sample from. (A value of zero results in
            no jittering.)
        clip: An iterable of two values (minimum and maximum) to clip
            the jittered values to, default to None.
        random_state: An integer or 1-d array value used to set the random
            seed, default to None.

    Raises:
        TypeError: If `column_name` is not numeric.
        ValueError: If `scale` is not a numerical value
            greater than `0`.
        ValueError: If `clip` is not an iterable of length `2`.
        ValueError: If `clip[0]` is greater than `clip[1]`.

    Returns:
        A pandas DataFrame with a new column containing
            Gaussian-jittered values from another column.
    """

    # Check types
    check("scale", scale, [int, float])

    # Check that `column_name` is a numeric column
    if not np.issubdtype(df[column_name].dtype, np.number):
        raise TypeError(f"{column_name} must be a numeric column.")

    if scale <= 0:
        raise ValueError("`scale` must be a numeric value greater than 0.")
    values = df[column_name]
    if random_state is not None:
        np.random.seed(random_state)
    result = np.random.normal(loc=values, scale=scale)
    if clip:
        # Ensure `clip` has length 2
        if len(clip) != 2:
            raise ValueError("`clip` must be an iterable of length 2.")
        # Ensure the values in `clip` are ordered as min, max
        if clip[1] < clip[0]:
            raise ValueError(
                "`clip[0]` must be less than or equal to `clip[1]`."
            )
        result = np.clip(result, *clip)
    df[dest_column_name] = result

    return df

join_apply

Implementation of the join_apply function

join_apply(df, func, new_column_name)

Join the result of applying a function across dataframe rows.

This method does not mutate the original DataFrame.

This is a convenience function that allows us to apply arbitrary functions that take any combination of information from any of the columns. The only requirement is that the function signature takes in a row from the DataFrame.

Note

This function will be deprecated in a 1.x release. Please use jn.transform_column instead.

Examples:

Sum the result of two columns into a new column.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a":[1, 2, 3], "b": [2, 3, 4]})
>>> df
   a  b
0  1  2
1  2  3
2  3  4
>>> df.join_apply(
...     func=lambda x: 2 * x["a"] + x["b"],
...     new_column_name="2a+b",
... )
   a  b  2a+b
0  1  2     4
1  2  3     7
2  3  4    10

Incorporating conditionals in func.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": [1, 2, 3], "b": [20, 30, 40]})
>>> df
   a   b
0  1  20
1  2  30
2  3  40
>>> def take_a_if_even(x):
...     if x["a"] % 2 == 0:
...         return x["a"]
...     else:
...         return x["b"]
>>> df.join_apply(take_a_if_even, "a_if_even")
   a   b  a_if_even
0  1  20         20
1  2  30          2
2  3  40         40

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
func Callable

A function that is applied elementwise across all rows of the DataFrame.

required
new_column_name str

Name of the resulting column.

required

Returns:

Type Description
DataFrame

A pandas DataFrame with new column appended.

Source code in janitor/functions/join_apply.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
@pf.register_dataframe_method
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `jn.transform_columns` instead."
    )
)
def join_apply(
    df: pd.DataFrame,
    func: Callable,
    new_column_name: str,
) -> pd.DataFrame:
    """Join the result of applying a function across dataframe rows.

    This method does not mutate the original DataFrame.

    This is a convenience function that allows us to apply arbitrary functions
    that take any combination of information from any of the columns. The only
    requirement is that the function signature takes in a row from the
    DataFrame.

    !!!note

        This function will be deprecated in a 1.x release.
        Please use [`jn.transform_column`][janitor.functions.transform_columns.transform_column]
        instead.

    Examples:
        Sum the result of two columns into a new column.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"a":[1, 2, 3], "b": [2, 3, 4]})
        >>> df
           a  b
        0  1  2
        1  2  3
        2  3  4
        >>> df.join_apply(
        ...     func=lambda x: 2 * x["a"] + x["b"],
        ...     new_column_name="2a+b",
        ... )
           a  b  2a+b
        0  1  2     4
        1  2  3     7
        2  3  4    10

        Incorporating conditionals in `func`.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"a": [1, 2, 3], "b": [20, 30, 40]})
        >>> df
           a   b
        0  1  20
        1  2  30
        2  3  40
        >>> def take_a_if_even(x):
        ...     if x["a"] % 2 == 0:
        ...         return x["a"]
        ...     else:
        ...         return x["b"]
        >>> df.join_apply(take_a_if_even, "a_if_even")
           a   b  a_if_even
        0  1  20         20
        1  2  30          2
        2  3  40         40

    Args:
        df: A pandas DataFrame.
        func: A function that is applied elementwise across all rows of the
            DataFrame.
        new_column_name: Name of the resulting column.

    Returns:
        A pandas DataFrame with new column appended.
    """  # noqa: E501
    df = df.copy().join(df.apply(func, axis=1).rename(new_column_name))
    return df

label_encode

Implementation of label_encode function

label_encode(df, column_names)

Convert labels into numerical data.

This method will create a new column with the string _enc appended after the original column's name. Consider this to be syntactic sugar. This function uses the factorize pandas function under the hood.

This method behaves differently from encode_categorical. This method creates a new column of numeric data. encode_categorical replaces the dtype of the original column with a categorical dtype.

This method mutates the original DataFrame.

Note

This function will be deprecated in a 1.x release. Please use factorize_columns instead.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "foo": ["b", "b", "a", "c", "b"],
...     "bar": range(4, 9),
... })
>>> df
  foo  bar
0   b    4
1   b    5
2   a    6
3   c    7
4   b    8
>>> df.label_encode(column_names="foo")
  foo  bar  foo_enc
0   b    4        0
1   b    5        0
2   a    6        1
3   c    7        2
4   b    8        0

Parameters:

Name Type Description Default
df DataFrame

The pandas DataFrame object.

required
column_names Union[str, Iterable[str], Hashable]

A column name or an iterable (list or tuple) of column names.

required

Returns:

Type Description
DataFrame

A pandas DataFrame.

Source code in janitor/functions/label_encode.py
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
@pf.register_dataframe_method
@deprecated_alias(columns="column_names")
def label_encode(
    df: pd.DataFrame,
    column_names: Union[str, Iterable[str], Hashable],
) -> pd.DataFrame:
    """Convert labels into numerical data.

    This method will create a new column with the string `_enc` appended
    after the original column's name.
    Consider this to be syntactic sugar.
    This function uses the `factorize` pandas function under the hood.

    This method behaves differently from
    [`encode_categorical`][janitor.functions.encode_categorical.encode_categorical].
    This method creates a new column of numeric data.
    [`encode_categorical`][janitor.functions.encode_categorical.encode_categorical]
    replaces the dtype of the original column with a *categorical* dtype.

    This method mutates the original DataFrame.

    !!!note

        This function will be deprecated in a 1.x release.
        Please use [`factorize_columns`][janitor.functions.factorize_columns.factorize_columns]
        instead.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "foo": ["b", "b", "a", "c", "b"],
        ...     "bar": range(4, 9),
        ... })
        >>> df
          foo  bar
        0   b    4
        1   b    5
        2   a    6
        3   c    7
        4   b    8
        >>> df.label_encode(column_names="foo")
          foo  bar  foo_enc
        0   b    4        0
        1   b    5        0
        2   a    6        1
        3   c    7        2
        4   b    8        0

    Args:
        df: The pandas DataFrame object.
        column_names: A column name or an iterable (list
            or tuple) of column names.

    Returns:
        A pandas DataFrame.
    """  # noqa: E501
    warnings.warn(
        "`label_encode` will be deprecated in a 1.x release. "
        "Please use `factorize_columns` instead."
    )
    df = _factorize(df, column_names, "_enc")
    return df

limit_column_characters

Implementation of limit_column_characters.

limit_column_characters(df, column_length, col_separator='_')

Truncate column sizes to a specific length.

This method mutates the original DataFrame.

Method chaining will truncate all columns to a given length and append a given separator character with the index of duplicate columns, except for the first distinct column name.

Examples:

>>> import pandas as pd
>>> import janitor
>>> data_dict = {
...     "really_long_name": [9, 8, 7],
...     "another_really_long_name": [2, 4, 6],
...     "another_really_longer_name": list("xyz"),
...     "this_is_getting_out_of_hand": list("pqr"),
... }
>>> df = pd.DataFrame(data_dict)
>>> df
   really_long_name  another_really_long_name another_really_longer_name this_is_getting_out_of_hand
0                 9                         2                          x                           p
1                 8                         4                          y                           q
2                 7                         6                          z                           r
>>> df.limit_column_characters(7)
   really_  another another_1 this_is
0        9        2         x       p
1        8        4         y       q
2        7        6         z       r

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_length int

Character length for which to truncate all columns. The column separator value and number for duplicate column name does not contribute. Therefore, if all columns are truncated to 10 characters, the first distinct column will be 10 characters and the remaining will be 12 characters (assuming a column separator of one character).

required
col_separator str

The separator to use for counting distinct column values, for example, '_' or '.'. Supply an empty string (i.e. '') to remove the separator.

'_'

Returns:

Type Description
DataFrame

A pandas DataFrame with truncated column lengths.

Source code in janitor/functions/limit_column_characters.py
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
@pf.register_dataframe_method
def limit_column_characters(
    df: pd.DataFrame,
    column_length: int,
    col_separator: str = "_",
) -> pd.DataFrame:
    """Truncate column sizes to a specific length.

    This method mutates the original DataFrame.

    Method chaining will truncate all columns to a given length and append
    a given separator character with the index of duplicate columns, except
    for the first distinct column name.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> data_dict = {
        ...     "really_long_name": [9, 8, 7],
        ...     "another_really_long_name": [2, 4, 6],
        ...     "another_really_longer_name": list("xyz"),
        ...     "this_is_getting_out_of_hand": list("pqr"),
        ... }
        >>> df = pd.DataFrame(data_dict)
        >>> df  # doctest: +SKIP
           really_long_name  another_really_long_name another_really_longer_name this_is_getting_out_of_hand
        0                 9                         2                          x                           p
        1                 8                         4                          y                           q
        2                 7                         6                          z                           r
        >>> df.limit_column_characters(7)
           really_  another another_1 this_is
        0        9        2         x       p
        1        8        4         y       q
        2        7        6         z       r

    Args:
        df: A pandas DataFrame.
        column_length: Character length for which to truncate all columns.
            The column separator value and number for duplicate column name does
            not contribute. Therefore, if all columns are truncated to 10
            characters, the first distinct column will be 10 characters and the
            remaining will be 12 characters (assuming a column separator of one
            character).
        col_separator: The separator to use for counting distinct column
            values, for example, `'_'` or `'.'`.
            Supply an empty string (i.e. `''`) to remove the separator.

    Returns:
        A pandas DataFrame with truncated column lengths.
    """  # noqa: E501

    check("column_length", column_length, [int])
    check("col_separator", col_separator, [str])

    col_names = df.columns
    col_names = [col_name[:column_length] for col_name in col_names]

    col_name_set = set(col_names)
    col_name_count = {}

    # If no columns are duplicates, we can skip the loops below.
    if len(col_name_set) == len(col_names):
        df.columns = col_names
        return df

    for col_name_to_check in col_name_set:
        count = 0
        for idx, col_name in enumerate(col_names):
            if col_name_to_check == col_name:
                col_name_count[idx] = count
                count += 1

    final_col_names = []
    for idx, col_name in enumerate(col_names):
        if col_name_count[idx] > 0:
            col_name_to_append = (
                col_name + col_separator + str(col_name_count[idx])
            )
            final_col_names.append(col_name_to_append)
        else:
            final_col_names.append(col_name)

    df.columns = final_col_names
    return df

min_max_scale

min_max_scale(df, feature_range=(0, 1), column_name=None, jointly=False)

Scales DataFrame to between a minimum and maximum value.

One can optionally set a new target minimum and maximum value using the feature_range keyword argument.

If column_name is specified, then only that column(s) of data is scaled. Otherwise, the entire dataframe is scaled. If jointly is True, the column_names provided entire dataframe will be regnozied as the one to jointly scale. Otherwise, each column of data will be scaled separately.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({'a':[1, 2], 'b':[0, 1]})
>>> df.min_max_scale()
     a    b
0  0.0  0.0
1  1.0  1.0
>>> df.min_max_scale(jointly=True)
     a    b
0  0.5  0.0
1  1.0  0.5

Setting custom minimum and maximum.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({'a':[1, 2], 'b':[0, 1]})
>>> df.min_max_scale(feature_range=(0, 100))
       a      b
0    0.0    0.0
1  100.0  100.0
>>> df.min_max_scale(feature_range=(0, 100), jointly=True)
       a     b
0   50.0   0.0
1  100.0  50.0

Apply min-max to the selected columns.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({'a':[1, 2], 'b':[0, 1], 'c': [1, 0]})
>>> df.min_max_scale(
...     feature_range=(0, 100),
...     column_name=["a", "c"],
... )
       a  b      c
0    0.0  0  100.0
1  100.0  1    0.0
>>> df.min_max_scale(
...     feature_range=(0, 100),
...     column_name=["a", "c"],
...     jointly=True,
... )
       a  b     c
0   50.0  0  50.0
1  100.0  1   0.0
>>> df.min_max_scale(feature_range=(0, 100), column_name='a')
       a  b  c
0    0.0  0  1
1  100.0  1  0

The aforementioned example might be applied to something like scaling the isoelectric points of amino acids. While technically they range from approx 3-10, we can also think of them on the pH scale which ranges from 1 to 14. Hence, 3 gets scaled not to 0 but approx. 0.15 instead, while 10 gets scaled to approx. 0.69 instead.

Version Changed

  • 0.24.0
    • Deleted old_min, old_max, new_min, and new_max options.
    • Added feature_range, and jointly options.

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
feature_range tuple[int | float, int | float]

Desired range of transformed data.

(0, 1)
column_name str | int | list[str | int] | Index

The column on which to perform scaling.

None
jointly bool

Scale the entire data if True.

False

Raises:

Type Description
ValueError

If feature_range isn't tuple type.

ValueError

If the length of feature_range isn't equal to two.

ValueError

If the element of feature_range isn't number type.

ValueError

If feature_range[1] <= feature_range[0].

Returns:

Type Description
DataFrame

A pandas DataFrame with scaled data.

Source code in janitor/functions/min_max_scale.py
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
@pf.register_dataframe_method
@deprecated_kwargs(
    "old_min",
    "old_max",
    "new_min",
    "new_max",
    message=(
        "The keyword argument {argument!r} of {func_name!r} is deprecated. "
        "Please use 'feature_range' instead."
    ),
)
@deprecated_alias(col_name="column_name")
def min_max_scale(
    df: pd.DataFrame,
    feature_range: tuple[int | float, int | float] = (0, 1),
    column_name: str | int | list[str | int] | pd.Index = None,
    jointly: bool = False,
) -> pd.DataFrame:
    """Scales DataFrame to between a minimum and maximum value.

    One can optionally set a new target **minimum** and **maximum** value
    using the `feature_range` keyword argument.

    If `column_name` is specified, then only that column(s) of data is scaled.
    Otherwise, the entire dataframe is scaled.
    If `jointly` is `True`, the `column_names` provided entire dataframe will
    be regnozied as the one to jointly scale. Otherwise, each column of data
    will be scaled separately.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({'a':[1, 2], 'b':[0, 1]})
        >>> df.min_max_scale()
             a    b
        0  0.0  0.0
        1  1.0  1.0
        >>> df.min_max_scale(jointly=True)
             a    b
        0  0.5  0.0
        1  1.0  0.5

        Setting custom minimum and maximum.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({'a':[1, 2], 'b':[0, 1]})
        >>> df.min_max_scale(feature_range=(0, 100))
               a      b
        0    0.0    0.0
        1  100.0  100.0
        >>> df.min_max_scale(feature_range=(0, 100), jointly=True)
               a     b
        0   50.0   0.0
        1  100.0  50.0

        Apply min-max to the selected columns.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({'a':[1, 2], 'b':[0, 1], 'c': [1, 0]})
        >>> df.min_max_scale(
        ...     feature_range=(0, 100),
        ...     column_name=["a", "c"],
        ... )
               a  b      c
        0    0.0  0  100.0
        1  100.0  1    0.0
        >>> df.min_max_scale(
        ...     feature_range=(0, 100),
        ...     column_name=["a", "c"],
        ...     jointly=True,
        ... )
               a  b     c
        0   50.0  0  50.0
        1  100.0  1   0.0
        >>> df.min_max_scale(feature_range=(0, 100), column_name='a')
               a  b  c
        0    0.0  0  1
        1  100.0  1  0

        The aforementioned example might be applied to something like scaling the
        isoelectric points of amino acids. While technically they range from
        approx 3-10, we can also think of them on the pH scale which ranges from
        1 to 14. Hence, 3 gets scaled not to 0 but approx. 0.15 instead, while 10
        gets scaled to approx. 0.69 instead.

    !!! summary "Version Changed"

        - 0.24.0
            - Deleted `old_min`, `old_max`, `new_min`, and `new_max` options.
            - Added `feature_range`, and `jointly` options.

    Args:
        df: A pandas DataFrame.
        feature_range: Desired range of transformed data.
        column_name: The column on which to perform scaling.
        jointly: Scale the entire data if True.

    Raises:
        ValueError: If `feature_range` isn't tuple type.
        ValueError: If the length of `feature_range` isn't equal to two.
        ValueError: If the element of `feature_range` isn't number type.
        ValueError: If `feature_range[1]` <= `feature_range[0]`.

    Returns:
        A pandas DataFrame with scaled data.
    """  # noqa: E501

    if not (
        isinstance(feature_range, (tuple, list))
        and len(feature_range) == 2
        and all((isinstance(i, (int, float))) for i in feature_range)
        and feature_range[1] > feature_range[0]
    ):
        raise ValueError(
            "`feature_range` should be a range type contains number element, "
            "the first element must be greater than the second one"
        )

    if column_name is not None:
        df = df.copy()  # Avoid to change the original DataFrame.

        old_feature_range = df[column_name].pipe(_min_max_value, jointly)
        df[column_name] = df[column_name].pipe(
            _apply_min_max,
            *old_feature_range,
            *feature_range,
        )
    else:
        old_feature_range = df.pipe(_min_max_value, jointly)
        df = df.pipe(
            _apply_min_max,
            *old_feature_range,
            *feature_range,
        )

    return df

move

Implementation of move.

move(df, source, target=None, position='before', axis=0)

Changes rows or columns positions in the dataframe.

It uses the select syntax, making it easy to move blocks of rows or columns at once.

This operation does not reset the index of the dataframe. User must explicitly do so.

The dataframe must have unique column names or indices.

Examples:

Move a row:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": [2, 4, 6, 8], "b": list("wxyz")})
>>> df
   a  b
0  2  w
1  4  x
2  6  y
3  8  z
>>> df.move(source=0, target=3, position="before", axis=0)
   a  b
1  4  x
2  6  y
0  2  w
3  8  z

Move a column:

>>> import pandas as pd
>>> import janitor
>>> data = [{"a": 1, "b": 1, "c": 1,
...          "d": "a", "e": "a","f": "a"}]
>>> df = pd.DataFrame(data)
>>> df
   a  b  c  d  e  f
0  1  1  1  a  a  a
>>> df.move(source="a", target="c", position="after", axis=1)
   b  c  a  d  e  f
0  1  1  1  a  a  a
>>> df.move(source="f", target="b", position="before", axis=1)
   a  f  b  c  d  e
0  1  a  1  1  a  a
>>> df.move(source="a", target=None, position="after", axis=1)
   b  c  d  e  f  a
0  1  1  a  a  a  1

Move columns:

>>> from pandas.api.types import is_numeric_dtype, is_string_dtype
>>> df.move(source=is_string_dtype, target=None, position="before", axis=1)
   d  e  f  a  b  c
0  a  a  a  1  1  1
>>> df.move(source=is_numeric_dtype, target=None, position="after", axis=1)
   d  e  f  a  b  c
0  a  a  a  1  1  1
>>> df.move(source = ["d", "f"], target=is_numeric_dtype, position="before", axis=1)
   d  f  a  b  c  e
0  a  a  1  1  1  a

Parameters:

Name Type Description Default
df DataFrame

The pandas DataFrame object.

required
source Any

Columns or rows to move.

required
target Any

Columns or rows to move adjacent to. If None and position == 'before', source is moved to the beginning; if position == 'after', source is moved to the end.

None
position str

Specifies the destination of the columns/rows. Values can be either before or after; defaults to before.

'before'
axis int

Axis along which the function is applied. 0 to move along the index, 1 to move along the columns.

0

Raises:

Type Description
ValueError

If axis is not 0 or 1.

ValueError

If position is not before or after.

Returns:

Type Description
DataFrame

The dataframe with the Series moved.

Source code in janitor/functions/move.py
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
@pf.register_dataframe_method
def move(
    df: pd.DataFrame,
    source: Any,
    target: Any = None,
    position: str = "before",
    axis: int = 0,
) -> pd.DataFrame:
    """Changes rows or columns positions in the dataframe.

    It uses the
    [`select`][janitor.functions.select.select] syntax,
    making it easy to move blocks of rows or columns at once.

    This operation does not reset the index of the dataframe. User must
    explicitly do so.

    The dataframe must have unique column names or indices.

    Examples:
        Move a row:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"a": [2, 4, 6, 8], "b": list("wxyz")})
        >>> df
           a  b
        0  2  w
        1  4  x
        2  6  y
        3  8  z
        >>> df.move(source=0, target=3, position="before", axis=0)
           a  b
        1  4  x
        2  6  y
        0  2  w
        3  8  z

        Move a column:
        >>> import pandas as pd
        >>> import janitor
        >>> data = [{"a": 1, "b": 1, "c": 1,
        ...          "d": "a", "e": "a","f": "a"}]
        >>> df = pd.DataFrame(data)
        >>> df
           a  b  c  d  e  f
        0  1  1  1  a  a  a
        >>> df.move(source="a", target="c", position="after", axis=1)
           b  c  a  d  e  f
        0  1  1  1  a  a  a
        >>> df.move(source="f", target="b", position="before", axis=1)
           a  f  b  c  d  e
        0  1  a  1  1  a  a
        >>> df.move(source="a", target=None, position="after", axis=1)
           b  c  d  e  f  a
        0  1  1  a  a  a  1

        Move columns:
        >>> from pandas.api.types import is_numeric_dtype, is_string_dtype
        >>> df.move(source=is_string_dtype, target=None, position="before", axis=1)
           d  e  f  a  b  c
        0  a  a  a  1  1  1
        >>> df.move(source=is_numeric_dtype, target=None, position="after", axis=1)
           d  e  f  a  b  c
        0  a  a  a  1  1  1
        >>> df.move(source = ["d", "f"], target=is_numeric_dtype, position="before", axis=1)
           d  f  a  b  c  e
        0  a  a  1  1  1  a

    Args:
        df: The pandas DataFrame object.
        source: Columns or rows to move.
        target: Columns or rows to move adjacent to.
            If `None` and `position == 'before'`, `source`
            is moved to the beginning; if `position == 'after'`,
            `source` is moved to the end.
        position: Specifies the destination of the columns/rows.
            Values can be either `before` or `after`; defaults to `before`.
        axis: Axis along which the function is applied. 0 to move along
            the index, 1 to move along the columns.

    Raises:
        ValueError: If `axis` is not `0` or `1`.
        ValueError: If `position` is not `before` or `after`.

    Returns:
        The dataframe with the Series moved.
    """  # noqa: E501
    if axis not in [0, 1]:
        raise ValueError(f"Invalid axis '{axis}'. Can only be 0 or 1.")

    if position not in ["before", "after"]:
        raise ValueError(
            f"Invalid position '{position}'. Can only be 'before' or 'after'."
        )

    mapping = {0: "index", 1: "columns"}
    names = getattr(df, mapping[axis])

    assert names.is_unique

    index = np.arange(names.size)
    source = _select_index([source], df, mapping[axis])
    source = _index_converter(source, index)
    if target is None:
        if position == "after":
            target = np.array([names.size])
        else:
            target = np.array([0])
    else:
        target = _select_index([target], df, mapping[axis])
        target = _index_converter(target, index)
    index = np.delete(index, source)

    if position == "before":
        position = index.searchsorted(target[0])
    else:
        position = index.searchsorted(target[-1]) + 1
    start = index[:position]
    end = index[position:]
    position = np.concatenate([start, source, end])

    return df.iloc(axis=axis)[position]

pivot

pivot_longer(df, index=None, column_names=None, names_to=None, values_to='value', column_level=None, names_sep=None, names_pattern=None, names_transform=None, dropna=False, sort_by_appearance=False, ignore_index=True)

Unpivots a DataFrame from wide to long format.

This method does not mutate the original DataFrame.

It is modeled after the pivot_longer function in R's tidyr package, and also takes inspiration from R's data.table package.

This function is useful to massage a DataFrame into a format where one or more columns are considered measured variables, and all other columns are considered as identifier variables.

All measured variables are unpivoted (and typically duplicated) along the row axis.

Column selection in index and column_names is possible using the select syntax.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame(
...     {
...         "Sepal.Length": [5.1, 5.9],
...         "Sepal.Width": [3.5, 3.0],
...         "Petal.Length": [1.4, 5.1],
...         "Petal.Width": [0.2, 1.8],
...         "Species": ["setosa", "virginica"],
...     }
... )
>>> df
   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
0           5.1          3.5           1.4          0.2     setosa
1           5.9          3.0           5.1          1.8  virginica

Replicate pandas' melt:

>>> df.pivot_longer(index = 'Species')
     Species      variable  value
0     setosa  Sepal.Length    5.1
1  virginica  Sepal.Length    5.9
2     setosa   Sepal.Width    3.5
3  virginica   Sepal.Width    3.0
4     setosa  Petal.Length    1.4
5  virginica  Petal.Length    5.1
6     setosa   Petal.Width    0.2
7  virginica   Petal.Width    1.8

Convenient, flexible column selection in the index via the select syntax:

>>> from pandas.api.types import is_string_dtype
>>> df.pivot_longer(index = is_string_dtype)
     Species      variable  value
0     setosa  Sepal.Length    5.1
1  virginica  Sepal.Length    5.9
2     setosa   Sepal.Width    3.5
3  virginica   Sepal.Width    3.0
4     setosa  Petal.Length    1.4
5  virginica  Petal.Length    5.1
6     setosa   Petal.Width    0.2
7  virginica   Petal.Width    1.8

Split the column labels into parts:

>>> df.pivot_longer(
...     index = 'Species',
...     names_to = ('part', 'dimension'),
...     names_sep = '.',
...     sort_by_appearance = True,
... )
     Species   part dimension  value
0     setosa  Sepal    Length    5.1
1     setosa  Sepal     Width    3.5
2     setosa  Petal    Length    1.4
3     setosa  Petal     Width    0.2
4  virginica  Sepal    Length    5.9
5  virginica  Sepal     Width    3.0
6  virginica  Petal    Length    5.1
7  virginica  Petal     Width    1.8

Retain parts of the column names as headers:

>>> df.pivot_longer(
...     index = 'Species',
...     names_to = ('part', '.value'),
...     names_sep = '.',
...     sort_by_appearance = True,
... )
     Species   part  Length  Width
0     setosa  Sepal     5.1    3.5
1     setosa  Petal     1.4    0.2
2  virginica  Sepal     5.9    3.0
3  virginica  Petal     5.1    1.8

Split the column labels based on regex:

>>> df = pd.DataFrame({"id": [1], "new_sp_m5564": [2], "newrel_f65": [3]})
>>> df
   id  new_sp_m5564  newrel_f65
0   1             2           3
>>> df.pivot_longer(
...     index = 'id',
...     names_to = ('diagnosis', 'gender', 'age'),
...     names_pattern = r"new_?(.+)_(.)(\d+)",
... )
   id diagnosis gender   age  value
0   1        sp      m  5564      2
1   1       rel      f    65      3

Split the column labels for the above dataframe using named groups in names_pattern:

>>> df.pivot_longer(
...     index = 'id',
...     names_pattern = r"new_?(?P<diagnosis>.+)_(?P<gender>.)(?P<age>\d+)",
... )
    id diagnosis gender   age  value
0   1        sp      m  5564      2
1   1       rel      f    65      3

Convert the dtypes of specific columns with names_transform:

>>> result = (df
...          .pivot_longer(
...              index = 'id',
...              names_to = ('diagnosis', 'gender', 'age'),
...              names_pattern = r"new_?(.+)_(.)(\d+)",
...              names_transform = {'gender': 'category', 'age':'int'})
... )
>>> result.dtypes
id           int64
diagnosis   object
gender    category
age          int64
value        int64
dtype: object

Use multiple .value to reshape dataframe:

>>> df = pd.DataFrame(
...     [
...         {
...             "x_1_mean": 10,
...             "x_2_mean": 20,
...             "y_1_mean": 30,
...             "y_2_mean": 40,
...             "unit": 50,
...         }
...     ]
... )
>>> df
   x_1_mean  x_2_mean  y_1_mean  y_2_mean  unit
0        10        20        30        40    50
>>> df.pivot_longer(
...     index="unit",
...     names_to=(".value", "time", ".value"),
...     names_pattern=r"(x|y)_([0-9])(_mean)",
... )
   unit time  x_mean  y_mean
0    50    1      10      30
1    50    2      20      40

Replicate the above with named groups in names_pattern - use _ instead of .value:

>>> df.pivot_longer(
...     index="unit",
...     names_pattern=r"(?P<_>x|y)_(?P<time>[0-9])(?P<__>_mean)",
... )
   unit time  x_mean  y_mean
0    50    1      10      30
1    50    2      20      40

Convenient, flexible column selection in the column_names via select syntax:

>>> df.pivot_longer(
...     column_names="*mean",
...     names_to=(".value", "time", ".value"),
...     names_pattern=r"(x|y)_([0-9])(_mean)",
... )
   unit time  x_mean  y_mean
0    50    1      10      30
1    50    2      20      40
>>> df.pivot_longer(
...     column_names=slice("x_1_mean", "y_2_mean"),
...     names_to=(".value", "time", ".value"),
...     names_pattern=r"(x|y)_([0-9])(_mean)",
... )
   unit time  x_mean  y_mean
0    50    1      10      30
1    50    2      20      40

Reshape dataframe by passing a sequence to names_pattern:

>>> df = pd.DataFrame({'hr1': [514, 573],
...                    'hr2': [545, 526],
...                    'team': ['Red Sox', 'Yankees'],
...                    'year1': [2007, 2007],
...                    'year2': [2008, 2008]})
>>> df
   hr1  hr2     team  year1  year2
0  514  545  Red Sox   2007   2008
1  573  526  Yankees   2007   2008
>>> df.pivot_longer(
...     index = 'team',
...     names_to = ['year', 'hr'],
...     names_pattern = ['year', 'hr']
... )
      team   hr  year
0  Red Sox  514  2007
1  Yankees  573  2007
2  Red Sox  545  2008
3  Yankees  526  2008

Reshape above dataframe by passing a dictionary to names_pattern:

>>> df.pivot_longer(
...     index = 'team',
...     names_pattern = {"year":"year", "hr":"hr"}
... )
      team   hr  year
0  Red Sox  514  2007
1  Yankees  573  2007
2  Red Sox  545  2008
3  Yankees  526  2008

Multiple values_to:

>>> df = pd.DataFrame(
...         {
...             "City": ["Houston", "Austin", "Hoover"],
...             "State": ["Texas", "Texas", "Alabama"],
...             "Name": ["Aria", "Penelope", "Niko"],
...             "Mango": [4, 10, 90],
...             "Orange": [10, 8, 14],
...             "Watermelon": [40, 99, 43],
...             "Gin": [16, 200, 34],
...             "Vodka": [20, 33, 18],
...         },
...         columns=[
...             "City",
...             "State",
...             "Name",
...             "Mango",
...             "Orange",
...             "Watermelon",
...             "Gin",
...             "Vodka",
...         ],
...     )
>>> df
      City    State      Name  Mango  Orange  Watermelon  Gin  Vodka
0  Houston    Texas      Aria      4      10          40   16     20
1   Austin    Texas  Penelope     10       8          99  200     33
2   Hoover  Alabama      Niko     90      14          43   34     18
>>> df.pivot_longer(
...         index=["City", "State"],
...         column_names=slice("Mango", "Vodka"),
...         names_to=("Fruit", "Drink"),
...         values_to=("Pounds", "Ounces"),
...         names_pattern=["M|O|W", "G|V"],
...     )
      City    State       Fruit  Pounds  Drink  Ounces
0  Houston    Texas       Mango       4    Gin    16.0
1   Austin    Texas       Mango      10    Gin   200.0
2   Hoover  Alabama       Mango      90    Gin    34.0
3  Houston    Texas      Orange      10  Vodka    20.0
4   Austin    Texas      Orange       8  Vodka    33.0
5   Hoover  Alabama      Orange      14  Vodka    18.0
6  Houston    Texas  Watermelon      40   None     NaN
7   Austin    Texas  Watermelon      99   None     NaN
8   Hoover  Alabama  Watermelon      43   None     NaN

Replicate the above transformation with a nested dictionary passed to names_pattern - the outer keys in the names_pattern dictionary are passed to names_to, while the inner keys are passed to values_to:

>>> df.pivot_longer(
...     index=["City", "State"],
...     column_names=slice("Mango", "Vodka"),
...     names_pattern={
...         "Fruit": {"Pounds": "M|O|W"},
...         "Drink": {"Ounces": "G|V"},
...     },
... )
      City    State       Fruit  Pounds  Drink  Ounces
0  Houston    Texas       Mango       4    Gin    16.0
1   Austin    Texas       Mango      10    Gin   200.0
2   Hoover  Alabama       Mango      90    Gin    34.0
3  Houston    Texas      Orange      10  Vodka    20.0
4   Austin    Texas      Orange       8  Vodka    33.0
5   Hoover  Alabama      Orange      14  Vodka    18.0
6  Houston    Texas  Watermelon      40   None     NaN
7   Austin    Texas  Watermelon      99   None     NaN
8   Hoover  Alabama  Watermelon      43   None     NaN

Version Changed

  • 0.24.0
    • Added dropna parameter.
  • 0.24.1
    • names_pattern can accept a dictionary.
    • named groups supported in names_pattern.

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
index Optional[Union[list, tuple, str, Pattern]]

Name(s) of columns to use as identifier variables. Should be either a single column name, or a list/tuple of column names. index should be a list of tuples if the columns are a MultiIndex.

None
column_names Optional[Union[list, tuple, str, Pattern]]

Name(s) of columns to unpivot. Should be either a single column name or a list/tuple of column names. column_names should be a list of tuples if the columns are a MultiIndex.

None
names_to Optional[Union[list, tuple, str]]

Name of new column as a string that will contain what were previously the column names in column_names. The default is variable if no value is provided. It can also be a list/tuple of strings that will serve as new column names, if name_sep or names_pattern is provided. If .value is in names_to, new column names will be extracted from part of the existing column names and overridesvalues_to.

None
values_to Optional[str]

Name of new column as a string that will contain what were previously the values of the columns in column_names. values_to can also be a list/tuple and requires that names_pattern is also a list/tuple.

'value'
column_level Optional[Union[int, str]]

If columns are a MultiIndex, then use this level to unpivot the DataFrame. Provided for compatibility with pandas' melt, and applies only if neither names_sep nor names_pattern is provided.

None
names_sep Optional[Union[str, Pattern]]

Determines how the column name is broken up, if names_to contains multiple values. It takes the same specification as pandas' str.split method, and can be a string or regular expression. names_sep does not work with MultiIndex columns.

None
names_pattern Optional[Union[list, tuple, str, Pattern]]

Determines how the column name is broken up. It can be a regular expression containing matching groups. Under the hood it is processed with pandas' str.extract function. If it is a single regex, the number of groups must match the length of names_to. Named groups are supported, if names_to is none. _ is used instead of .value as a placeholder in named groups. _ can be overloaded for multiple .value calls - _, __, ___, ... names_pattern can also be a list/tuple of regular expressions It can also be a list/tuple of strings; the strings will be treated as regular expressions. Under the hood it is processed with pandas' str.contains function. For a list/tuple of regular expressions, names_to must also be a list/tuple and the lengths of both arguments must match. names_pattern can also be a dictionary, where the keys are the new column names, while the values can be a regular expression or a string which will be evaluated as a regular expression. Alternatively, a nested dictionary can be used, where the sub key(s) are associated with values_to. Please have a look at the examples for usage. names_pattern does not work with MultiIndex columns.

None
names_transform Optional[Union[str, Callable, dict]]

Use this option to change the types of columns that have been transformed to rows. This does not applies to the values' columns. Accepts any argument that is acceptable by pd.astype.

None
dropna bool

Determines whether or not to drop nulls from the values columns. Default is False.

False
sort_by_appearance Optional[bool]

Boolean value that determines the final look of the DataFrame. If True, the unpivoted DataFrame will be stacked in order of first appearance.

False
ignore_index Optional[bool]

If True, the original index is ignored. If False, the original index is retained and the index labels will be repeated as necessary.

True

Returns:

Type Description
DataFrame

A pandas DataFrame that has been unpivoted from wide to long format.

Source code in janitor/functions/pivot.py
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
@pf.register_dataframe_method
def pivot_longer(
    df: pd.DataFrame,
    index: Optional[Union[list, tuple, str, Pattern]] = None,
    column_names: Optional[Union[list, tuple, str, Pattern]] = None,
    names_to: Optional[Union[list, tuple, str]] = None,
    values_to: Optional[str] = "value",
    column_level: Optional[Union[int, str]] = None,
    names_sep: Optional[Union[str, Pattern]] = None,
    names_pattern: Optional[Union[list, tuple, str, Pattern]] = None,
    names_transform: Optional[Union[str, Callable, dict]] = None,
    dropna: bool = False,
    sort_by_appearance: Optional[bool] = False,
    ignore_index: Optional[bool] = True,
) -> pd.DataFrame:
    """Unpivots a DataFrame from *wide* to *long* format.

    This method does not mutate the original DataFrame.

    It is modeled after the `pivot_longer` function in R's tidyr package,
    and also takes inspiration from R's data.table package.

    This function is useful to massage a DataFrame into a format where
    one or more columns are considered measured variables, and all other
    columns are considered as identifier variables.

    All measured variables are *unpivoted* (and typically duplicated) along the
    row axis.

    Column selection in `index` and `column_names` is possible using the
    [`select`][janitor.functions.select.select] syntax.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame(
        ...     {
        ...         "Sepal.Length": [5.1, 5.9],
        ...         "Sepal.Width": [3.5, 3.0],
        ...         "Petal.Length": [1.4, 5.1],
        ...         "Petal.Width": [0.2, 1.8],
        ...         "Species": ["setosa", "virginica"],
        ...     }
        ... )
        >>> df
           Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
        0           5.1          3.5           1.4          0.2     setosa
        1           5.9          3.0           5.1          1.8  virginica

        Replicate pandas' melt:
        >>> df.pivot_longer(index = 'Species')
             Species      variable  value
        0     setosa  Sepal.Length    5.1
        1  virginica  Sepal.Length    5.9
        2     setosa   Sepal.Width    3.5
        3  virginica   Sepal.Width    3.0
        4     setosa  Petal.Length    1.4
        5  virginica  Petal.Length    5.1
        6     setosa   Petal.Width    0.2
        7  virginica   Petal.Width    1.8

        Convenient, flexible column selection in the `index` via the
        [`select`][janitor.functions.select.select] syntax:
        >>> from pandas.api.types import is_string_dtype
        >>> df.pivot_longer(index = is_string_dtype)
             Species      variable  value
        0     setosa  Sepal.Length    5.1
        1  virginica  Sepal.Length    5.9
        2     setosa   Sepal.Width    3.5
        3  virginica   Sepal.Width    3.0
        4     setosa  Petal.Length    1.4
        5  virginica  Petal.Length    5.1
        6     setosa   Petal.Width    0.2
        7  virginica   Petal.Width    1.8

        Split the column labels into parts:
        >>> df.pivot_longer(
        ...     index = 'Species',
        ...     names_to = ('part', 'dimension'),
        ...     names_sep = '.',
        ...     sort_by_appearance = True,
        ... )
             Species   part dimension  value
        0     setosa  Sepal    Length    5.1
        1     setosa  Sepal     Width    3.5
        2     setosa  Petal    Length    1.4
        3     setosa  Petal     Width    0.2
        4  virginica  Sepal    Length    5.9
        5  virginica  Sepal     Width    3.0
        6  virginica  Petal    Length    5.1
        7  virginica  Petal     Width    1.8

        Retain parts of the column names as headers:
        >>> df.pivot_longer(
        ...     index = 'Species',
        ...     names_to = ('part', '.value'),
        ...     names_sep = '.',
        ...     sort_by_appearance = True,
        ... )
             Species   part  Length  Width
        0     setosa  Sepal     5.1    3.5
        1     setosa  Petal     1.4    0.2
        2  virginica  Sepal     5.9    3.0
        3  virginica  Petal     5.1    1.8

        Split the column labels based on regex:
        >>> df = pd.DataFrame({"id": [1], "new_sp_m5564": [2], "newrel_f65": [3]})
        >>> df
           id  new_sp_m5564  newrel_f65
        0   1             2           3
        >>> df.pivot_longer(
        ...     index = 'id',
        ...     names_to = ('diagnosis', 'gender', 'age'),
        ...     names_pattern = r"new_?(.+)_(.)(\\d+)",
        ... )
           id diagnosis gender   age  value
        0   1        sp      m  5564      2
        1   1       rel      f    65      3

        Split the column labels for the above dataframe using named groups in `names_pattern`:
        >>> df.pivot_longer(
        ...     index = 'id',
        ...     names_pattern = r"new_?(?P<diagnosis>.+)_(?P<gender>.)(?P<age>\\d+)",
        ... )
            id diagnosis gender   age  value
        0   1        sp      m  5564      2
        1   1       rel      f    65      3

        Convert the dtypes of specific columns with `names_transform`:
        >>> result = (df
        ...          .pivot_longer(
        ...              index = 'id',
        ...              names_to = ('diagnosis', 'gender', 'age'),
        ...              names_pattern = r"new_?(.+)_(.)(\\d+)",
        ...              names_transform = {'gender': 'category', 'age':'int'})
        ... )
        >>> result.dtypes
        id           int64
        diagnosis   object
        gender    category
        age          int64
        value        int64
        dtype: object

        Use multiple `.value` to reshape dataframe:
        >>> df = pd.DataFrame(
        ...     [
        ...         {
        ...             "x_1_mean": 10,
        ...             "x_2_mean": 20,
        ...             "y_1_mean": 30,
        ...             "y_2_mean": 40,
        ...             "unit": 50,
        ...         }
        ...     ]
        ... )
        >>> df
           x_1_mean  x_2_mean  y_1_mean  y_2_mean  unit
        0        10        20        30        40    50
        >>> df.pivot_longer(
        ...     index="unit",
        ...     names_to=(".value", "time", ".value"),
        ...     names_pattern=r"(x|y)_([0-9])(_mean)",
        ... )
           unit time  x_mean  y_mean
        0    50    1      10      30
        1    50    2      20      40

        Replicate the above with named groups in `names_pattern` - use `_` instead of `.value`:
        >>> df.pivot_longer(
        ...     index="unit",
        ...     names_pattern=r"(?P<_>x|y)_(?P<time>[0-9])(?P<__>_mean)",
        ... )
           unit time  x_mean  y_mean
        0    50    1      10      30
        1    50    2      20      40

        Convenient, flexible column selection in the `column_names` via
        [`select`][janitor.functions.select.select] syntax:
        >>> df.pivot_longer(
        ...     column_names="*mean",
        ...     names_to=(".value", "time", ".value"),
        ...     names_pattern=r"(x|y)_([0-9])(_mean)",
        ... )
           unit time  x_mean  y_mean
        0    50    1      10      30
        1    50    2      20      40

        >>> df.pivot_longer(
        ...     column_names=slice("x_1_mean", "y_2_mean"),
        ...     names_to=(".value", "time", ".value"),
        ...     names_pattern=r"(x|y)_([0-9])(_mean)",
        ... )
           unit time  x_mean  y_mean
        0    50    1      10      30
        1    50    2      20      40

        Reshape dataframe by passing a sequence to `names_pattern`:
        >>> df = pd.DataFrame({'hr1': [514, 573],
        ...                    'hr2': [545, 526],
        ...                    'team': ['Red Sox', 'Yankees'],
        ...                    'year1': [2007, 2007],
        ...                    'year2': [2008, 2008]})
        >>> df
           hr1  hr2     team  year1  year2
        0  514  545  Red Sox   2007   2008
        1  573  526  Yankees   2007   2008
        >>> df.pivot_longer(
        ...     index = 'team',
        ...     names_to = ['year', 'hr'],
        ...     names_pattern = ['year', 'hr']
        ... )
              team   hr  year
        0  Red Sox  514  2007
        1  Yankees  573  2007
        2  Red Sox  545  2008
        3  Yankees  526  2008


        Reshape above dataframe by passing a dictionary to `names_pattern`:
        >>> df.pivot_longer(
        ...     index = 'team',
        ...     names_pattern = {"year":"year", "hr":"hr"}
        ... )
              team   hr  year
        0  Red Sox  514  2007
        1  Yankees  573  2007
        2  Red Sox  545  2008
        3  Yankees  526  2008

        Multiple values_to:
        >>> df = pd.DataFrame(
        ...         {
        ...             "City": ["Houston", "Austin", "Hoover"],
        ...             "State": ["Texas", "Texas", "Alabama"],
        ...             "Name": ["Aria", "Penelope", "Niko"],
        ...             "Mango": [4, 10, 90],
        ...             "Orange": [10, 8, 14],
        ...             "Watermelon": [40, 99, 43],
        ...             "Gin": [16, 200, 34],
        ...             "Vodka": [20, 33, 18],
        ...         },
        ...         columns=[
        ...             "City",
        ...             "State",
        ...             "Name",
        ...             "Mango",
        ...             "Orange",
        ...             "Watermelon",
        ...             "Gin",
        ...             "Vodka",
        ...         ],
        ...     )
        >>> df
              City    State      Name  Mango  Orange  Watermelon  Gin  Vodka
        0  Houston    Texas      Aria      4      10          40   16     20
        1   Austin    Texas  Penelope     10       8          99  200     33
        2   Hoover  Alabama      Niko     90      14          43   34     18
        >>> df.pivot_longer(
        ...         index=["City", "State"],
        ...         column_names=slice("Mango", "Vodka"),
        ...         names_to=("Fruit", "Drink"),
        ...         values_to=("Pounds", "Ounces"),
        ...         names_pattern=["M|O|W", "G|V"],
        ...     )
              City    State       Fruit  Pounds  Drink  Ounces
        0  Houston    Texas       Mango       4    Gin    16.0
        1   Austin    Texas       Mango      10    Gin   200.0
        2   Hoover  Alabama       Mango      90    Gin    34.0
        3  Houston    Texas      Orange      10  Vodka    20.0
        4   Austin    Texas      Orange       8  Vodka    33.0
        5   Hoover  Alabama      Orange      14  Vodka    18.0
        6  Houston    Texas  Watermelon      40   None     NaN
        7   Austin    Texas  Watermelon      99   None     NaN
        8   Hoover  Alabama  Watermelon      43   None     NaN

        Replicate the above transformation with a nested dictionary passed to `names_pattern`
        - the outer keys in the `names_pattern` dictionary are passed to `names_to`,
        while the inner keys are passed to `values_to`:
        >>> df.pivot_longer(
        ...     index=["City", "State"],
        ...     column_names=slice("Mango", "Vodka"),
        ...     names_pattern={
        ...         "Fruit": {"Pounds": "M|O|W"},
        ...         "Drink": {"Ounces": "G|V"},
        ...     },
        ... )
              City    State       Fruit  Pounds  Drink  Ounces
        0  Houston    Texas       Mango       4    Gin    16.0
        1   Austin    Texas       Mango      10    Gin   200.0
        2   Hoover  Alabama       Mango      90    Gin    34.0
        3  Houston    Texas      Orange      10  Vodka    20.0
        4   Austin    Texas      Orange       8  Vodka    33.0
        5   Hoover  Alabama      Orange      14  Vodka    18.0
        6  Houston    Texas  Watermelon      40   None     NaN
        7   Austin    Texas  Watermelon      99   None     NaN
        8   Hoover  Alabama  Watermelon      43   None     NaN

    !!! abstract "Version Changed"

        - 0.24.0
            - Added `dropna` parameter.
        - 0.24.1
            - `names_pattern` can accept a dictionary.
            - named groups supported in `names_pattern`.

    Args:
        df: A pandas DataFrame.
        index: Name(s) of columns to use as identifier variables.
            Should be either a single column name, or a list/tuple of
            column names.
            `index` should be a list of tuples if the columns are a MultiIndex.
        column_names: Name(s) of columns to unpivot. Should be either
            a single column name or a list/tuple of column names.
            `column_names` should be a list of tuples
            if the columns are a MultiIndex.
        names_to: Name of new column as a string that will contain
            what were previously the column names in `column_names`.
            The default is `variable` if no value is provided. It can
            also be a list/tuple of strings that will serve as new column
            names, if `name_sep` or `names_pattern` is provided.
            If `.value` is in `names_to`, new column names will be extracted
            from part of the existing column names and overrides`values_to`.
        values_to: Name of new column as a string that will contain what
            were previously the values of the columns in `column_names`.
            values_to can also be a list/tuple
            and requires that names_pattern is also a list/tuple.
        column_level: If columns are a MultiIndex, then use this level to
            unpivot the DataFrame. Provided for compatibility with pandas' melt,
            and applies only if neither `names_sep` nor `names_pattern` is
            provided.
        names_sep: Determines how the column name is broken up, if
            `names_to` contains multiple values. It takes the same
            specification as pandas' `str.split` method, and can be a string
            or regular expression. `names_sep` does not work with MultiIndex
            columns.
        names_pattern: Determines how the column name is broken up.
            It can be a regular expression containing matching groups.
            Under the hood it is processed with pandas' `str.extract` function.
            If it is a single regex, the number of groups must match
            the length of `names_to`.
            Named groups are supported, if `names_to` is none. `_` is used
            instead of `.value` as a placeholder in named groups.
            `_` can be overloaded for multiple `.value`
            calls - `_`, `__`, `___`, ...
            `names_pattern` can also be a list/tuple of regular expressions
            It can also be a list/tuple of strings;
            the strings will be treated as regular expressions.
            Under the hood it is processed with pandas' `str.contains` function.
            For a list/tuple of regular expressions,
            `names_to` must also be a list/tuple and the lengths of both
            arguments must match.
            `names_pattern` can also be a dictionary, where the keys are
            the new column names, while the values can be a regular expression
            or a string which will be evaluated as a regular expression.
            Alternatively, a nested dictionary can be used, where the sub
            key(s) are associated with `values_to`. Please have a look
            at the examples for usage.
            `names_pattern` does not work with MultiIndex columns.
        names_transform: Use this option to change the types of columns that
            have been transformed to rows. This does not applies to the values' columns.
            Accepts any argument that is acceptable by `pd.astype`.
        dropna: Determines whether or not to drop nulls
            from the values columns. Default is `False`.
        sort_by_appearance: Boolean value that determines
            the final look of the DataFrame. If `True`, the unpivoted DataFrame
            will be stacked in order of first appearance.
        ignore_index: If `True`,
            the original index is ignored. If `False`, the original index
            is retained and the index labels will be repeated as necessary.

    Returns:
        A pandas DataFrame that has been unpivoted from wide to long
            format.
    """  # noqa: E501

    # this code builds on the wonderful work of @benjaminjack’s PR
    # https://github.com/benjaminjack/pyjanitor/commit/e3df817903c20dd21634461c8a92aec137963ed0

    return _computations_pivot_longer(
        df=df,
        index=index,
        column_names=column_names,
        column_level=column_level,
        names_to=names_to,
        values_to=values_to,
        names_sep=names_sep,
        names_pattern=names_pattern,
        names_transform=names_transform,
        dropna=dropna,
        sort_by_appearance=sort_by_appearance,
        ignore_index=ignore_index,
    )

pivot_wider(df, index=None, names_from=None, values_from=None, flatten_levels=True, names_sep='_', names_glue=None, reset_index=True, names_expand=False, index_expand=False)

Reshapes data from long to wide form.

Note

This function will be deprecated in a 1.x release. Please use pd.DataFrame.pivot instead.

The number of columns are increased, while decreasing the number of rows. It is the inverse of the pivot_longer method, and is a wrapper around pd.DataFrame.pivot method.

This method does not mutate the original DataFrame.

Column selection in index, names_from and values_from is possible using the select syntax.

A ValueError is raised if the combination of the index and names_from is not unique.

By default, values from values_from are always at the top level if the columns are not flattened. If flattened, the values from values_from are usually at the start of each label in the columns.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = [{'dep': 5.5, 'step': 1, 'a': 20, 'b': 30},
...       {'dep': 5.5, 'step': 2, 'a': 25, 'b': 37},
...       {'dep': 6.1, 'step': 1, 'a': 22, 'b': 19},
...       {'dep': 6.1, 'step': 2, 'a': 18, 'b': 29}]
>>> df = pd.DataFrame(df)
>>> df
   dep  step   a   b
0  5.5     1  20  30
1  5.5     2  25  37
2  6.1     1  22  19
3  6.1     2  18  29

Pivot and flatten columns:

>>> df.pivot_wider(
...     index = "dep",
...     names_from = "step",
... )
   dep  a_1  a_2  b_1  b_2
0  5.5   20   25   30   37
1  6.1   22   18   19   29

Modify columns with names_sep:

>>> df.pivot_wider(
...     index = "dep",
...     names_from = "step",
...     names_sep = "",
... )
   dep  a1  a2  b1  b2
0  5.5  20  25  30  37
1  6.1  22  18  19  29

Modify columns with names_glue:

>>> df.pivot_wider(
...     index = "dep",
...     names_from = "step",
...     names_glue = "{_value}_step{step}",
... )
   dep  a_step1  a_step2  b_step1  b_step2
0  5.5       20       25       30       37
1  6.1       22       18       19       29

Expand columns to expose implicit missing values - this applies only to categorical columns:

>>> weekdays = ("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
>>> daily = pd.DataFrame(
...     {
...         "day": pd.Categorical(
...             values=("Tue", "Thu", "Fri", "Mon"), categories=weekdays
...         ),
...         "value": (2, 3, 1, 5),
...     },
... index=[0, 0, 0, 0],
... )
>>> daily
   day  value
0  Tue      2
0  Thu      3
0  Fri      1
0  Mon      5
>>> daily.pivot_wider(names_from='day', values_from='value')
   Tue  Thu  Fri  Mon
0    2    3    1    5
>>> (daily
... .pivot_wider(
...     names_from='day',
...     values_from='value',
...     names_expand=True)
... )
   Mon  Tue  Wed  Thu  Fri  Sat  Sun
0    5    2  NaN    3    1  NaN  NaN

Expand the index to expose implicit missing values - this applies only to categorical columns:

>>> daily = daily.assign(letter = list('ABBA'))
>>> daily
   day  value letter
0  Tue      2      A
0  Thu      3      B
0  Fri      1      B
0  Mon      5      A
>>> daily.pivot_wider(index='day',names_from='letter',values_from='value')
   day    A    B
0  Tue  2.0  NaN
1  Thu  NaN  3.0
2  Fri  NaN  1.0
3  Mon  5.0  NaN
>>> (daily
... .pivot_wider(
...     index='day',
...     names_from='letter',
...     values_from='value',
...     index_expand=True)
... )
   day    A    B
0  Mon  5.0  NaN
1  Tue  2.0  NaN
2  Wed  NaN  NaN
3  Thu  NaN  3.0
4  Fri  NaN  1.0
5  Sat  NaN  NaN
6  Sun  NaN  NaN

Version Changed

  • 0.24.0
    • Added reset_index, names_expand and index_expand parameters.

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
index Optional[Union[list, str]]

Name(s) of columns to use as identifier variables. It should be either a single column name, or a list of column names. If index is not provided, the DataFrame's index is used.

None
names_from Optional[Union[list, str]]

Name(s) of column(s) to use to make the new DataFrame's columns. Should be either a single column name, or a list of column names.

None
values_from Optional[Union[list, str]]

Name(s) of column(s) that will be used for populating the new DataFrame's values. If values_from is not specified, all remaining columns will be used.

None
flatten_levels Optional[bool]

If False, the DataFrame stays as a MultiIndex.

True
names_sep str

If names_from or values_from contain multiple variables, this will be used to join the values into a single string to use as a column name. Default is _. Applicable only if flatten_levels is True.

'_'
names_glue str

A string to control the output of the flattened columns. It offers more flexibility in creating custom column names, and uses python's str.format_map under the hood. Simply create the string template, using the column labels in names_from, and special _value as a placeholder for values_from. Applicable only if flatten_levels is True.

None
reset_index bool

Determines whether to restore index as a column/columns. Applicable only if index is provided, and flatten_levels is True.

True
names_expand bool

Expand columns to show all the categories. Applies only if names_from is a categorical column.

False
index_expand bool

Expand the index to show all the categories. Applies only if index is a categorical column.

False

Returns:

Type Description
DataFrame

A pandas DataFrame that has been unpivoted from long to wide form.

Source code in janitor/functions/pivot.py
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
@pf.register_dataframe_method
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `pd.DataFrame.pivot` instead."
    )
)
def pivot_wider(
    df: pd.DataFrame,
    index: Optional[Union[list, str]] = None,
    names_from: Optional[Union[list, str]] = None,
    values_from: Optional[Union[list, str]] = None,
    flatten_levels: Optional[bool] = True,
    names_sep: str = "_",
    names_glue: str = None,
    reset_index: bool = True,
    names_expand: bool = False,
    index_expand: bool = False,
) -> pd.DataFrame:
    """Reshapes data from *long* to *wide* form.

    !!!note

        This function will be deprecated in a 1.x release.
        Please use `pd.DataFrame.pivot` instead.

    The number of columns are increased, while decreasing
    the number of rows. It is the inverse of the
    [`pivot_longer`][janitor.functions.pivot.pivot_longer]
    method, and is a wrapper around `pd.DataFrame.pivot` method.

    This method does not mutate the original DataFrame.

    Column selection in `index`, `names_from` and `values_from`
    is possible using the
    [`select`][janitor.functions.select.select] syntax.

    A ValueError is raised if the combination
    of the `index` and `names_from` is not unique.

    By default, values from `values_from` are always
    at the top level if the columns are not flattened.
    If flattened, the values from `values_from` are usually
    at the start of each label in the columns.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = [{'dep': 5.5, 'step': 1, 'a': 20, 'b': 30},
        ...       {'dep': 5.5, 'step': 2, 'a': 25, 'b': 37},
        ...       {'dep': 6.1, 'step': 1, 'a': 22, 'b': 19},
        ...       {'dep': 6.1, 'step': 2, 'a': 18, 'b': 29}]
        >>> df = pd.DataFrame(df)
        >>> df
           dep  step   a   b
        0  5.5     1  20  30
        1  5.5     2  25  37
        2  6.1     1  22  19
        3  6.1     2  18  29

        Pivot and flatten columns:
        >>> df.pivot_wider( # doctest: +SKIP
        ...     index = "dep",
        ...     names_from = "step",
        ... )
           dep  a_1  a_2  b_1  b_2
        0  5.5   20   25   30   37
        1  6.1   22   18   19   29

        Modify columns with `names_sep`:
        >>> df.pivot_wider( # doctest: +SKIP
        ...     index = "dep",
        ...     names_from = "step",
        ...     names_sep = "",
        ... )
           dep  a1  a2  b1  b2
        0  5.5  20  25  30  37
        1  6.1  22  18  19  29

        Modify columns with `names_glue`:
        >>> df.pivot_wider( # doctest: +SKIP
        ...     index = "dep",
        ...     names_from = "step",
        ...     names_glue = "{_value}_step{step}",
        ... )
           dep  a_step1  a_step2  b_step1  b_step2
        0  5.5       20       25       30       37
        1  6.1       22       18       19       29

        Expand columns to expose implicit missing values
        - this applies only to categorical columns:
        >>> weekdays = ("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
        >>> daily = pd.DataFrame(
        ...     {
        ...         "day": pd.Categorical(
        ...             values=("Tue", "Thu", "Fri", "Mon"), categories=weekdays
        ...         ),
        ...         "value": (2, 3, 1, 5),
        ...     },
        ... index=[0, 0, 0, 0],
        ... )
        >>> daily
           day  value
        0  Tue      2
        0  Thu      3
        0  Fri      1
        0  Mon      5
        >>> daily.pivot_wider(names_from='day', values_from='value') # doctest: +SKIP
           Tue  Thu  Fri  Mon
        0    2    3    1    5
        >>> (daily # doctest: +SKIP
        ... .pivot_wider(
        ...     names_from='day',
        ...     values_from='value',
        ...     names_expand=True)
        ... )
           Mon  Tue  Wed  Thu  Fri  Sat  Sun
        0    5    2  NaN    3    1  NaN  NaN

        Expand the index to expose implicit missing values
        - this applies only to categorical columns:
        >>> daily = daily.assign(letter = list('ABBA'))
        >>> daily
           day  value letter
        0  Tue      2      A
        0  Thu      3      B
        0  Fri      1      B
        0  Mon      5      A
        >>> daily.pivot_wider(index='day',names_from='letter',values_from='value') # doctest: +SKIP
           day    A    B
        0  Tue  2.0  NaN
        1  Thu  NaN  3.0
        2  Fri  NaN  1.0
        3  Mon  5.0  NaN
        >>> (daily # doctest: +SKIP
        ... .pivot_wider(
        ...     index='day',
        ...     names_from='letter',
        ...     values_from='value',
        ...     index_expand=True)
        ... )
           day    A    B
        0  Mon  5.0  NaN
        1  Tue  2.0  NaN
        2  Wed  NaN  NaN
        3  Thu  NaN  3.0
        4  Fri  NaN  1.0
        5  Sat  NaN  NaN
        6  Sun  NaN  NaN


    !!! abstract "Version Changed"

        - 0.24.0
            - Added `reset_index`, `names_expand` and `index_expand` parameters.

    Args:
        df: A pandas DataFrame.
        index: Name(s) of columns to use as identifier variables.
            It should be either a single column name, or a list of column names.
            If `index` is not provided, the DataFrame's index is used.
        names_from: Name(s) of column(s) to use to make the new
            DataFrame's columns. Should be either a single column name,
            or a list of column names.
        values_from: Name(s) of column(s) that will be used for populating
            the new DataFrame's values.
            If `values_from` is not specified,  all remaining columns
            will be used.
        flatten_levels: If `False`, the DataFrame stays as a MultiIndex.
        names_sep: If `names_from` or `values_from` contain multiple
            variables, this will be used to join the values into a single string
            to use as a column name. Default is `_`.
            Applicable only if `flatten_levels` is `True`.
        names_glue: A string to control the output of the flattened columns.
            It offers more flexibility in creating custom column names,
            and uses python's `str.format_map` under the hood.
            Simply create the string template,
            using the column labels in `names_from`,
            and special `_value` as a placeholder for `values_from`.
            Applicable only if `flatten_levels` is `True`.
        reset_index: Determines whether to restore `index`
            as a column/columns. Applicable only if `index` is provided,
            and `flatten_levels` is `True`.
        names_expand: Expand columns to show all the categories.
            Applies only if `names_from` is a categorical column.
        index_expand: Expand the index to show all the categories.
            Applies only if `index` is a categorical column.

    Returns:
        A pandas DataFrame that has been unpivoted from long to wide form.
    """  # noqa: E501

    # no need for an explicit copy --> df = df.copy()
    # `pd.pivot` creates one
    return _computations_pivot_wider(
        df,
        index,
        names_from,
        values_from,
        flatten_levels,
        names_sep,
        names_glue,
        reset_index,
        names_expand,
        index_expand,
    )

process_text

Implementation source for process_text.

process_text(df, column_name, string_function, **kwargs)

Apply a Pandas string method to an existing column.

This function aims to make string cleaning easy, while chaining, by simply passing the string method name, along with keyword arguments, if any, to the function.

This modifies an existing column; it does not create a new column; new columns can be created via pyjanitor's transform_columns.

A list of all the string methods in Pandas can be accessed here.

Note

This function will be deprecated in a 1.x release. Please use jn.transform_column instead.

Examples:

>>> import pandas as pd
>>> import janitor
>>> import re
>>> df = pd.DataFrame({"text": ["Ragnar", "sammywemmy", "ginger"],
... "code": [1, 2, 3]})
>>> df
         text  code
0      Ragnar     1
1  sammywemmy     2
2      ginger     3
>>> df.process_text(column_name="text", string_function="lower")
         text  code
0      ragnar     1
1  sammywemmy     2
2      ginger     3

For string methods with parameters, simply pass the keyword arguments:

>>> df.process_text(
...     column_name="text",
...     string_function="extract",
...     pat=r"(ag)",
...     expand=False,
...     flags=re.IGNORECASE,
... )
  text  code
0   ag     1
1  NaN     2
2  NaN     3

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_name str

String column to be operated on.

required
string_function str

pandas string method to be applied.

required
**kwargs Any

Keyword arguments for parameters of the string_function.

{}

Raises:

Type Description
KeyError

If string_function is not a Pandas string method.

ValueError

If the text function returns a DataFrame, instead of a Series.

Returns:

Type Description
DataFrame

A pandas DataFrame with modified column.

Source code in janitor/functions/process_text.py
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
@pf.register_dataframe_method
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `jn.transform_columns` instead."
    )
)
@deprecated_alias(column="column_name")
def process_text(
    df: pd.DataFrame,
    column_name: str,
    string_function: str,
    **kwargs: Any,
) -> pd.DataFrame:
    """Apply a Pandas string method to an existing column.

    This function aims to make string cleaning easy, while chaining,
    by simply passing the string method name,
    along with keyword arguments, if any, to the function.

    This modifies an existing column; it does not create a new column;
    new columns can be created via pyjanitor's
    [`transform_columns`][janitor.functions.transform_columns.transform_columns].

    A list of all the string methods in Pandas can be accessed [here](https://pandas.pydata.org/docs/user_guide/text.html#method-summary).

    !!!note

        This function will be deprecated in a 1.x release.
        Please use [`jn.transform_column`][janitor.functions.transform_columns.transform_column]
        instead.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> import re
        >>> df = pd.DataFrame({"text": ["Ragnar", "sammywemmy", "ginger"],
        ... "code": [1, 2, 3]})
        >>> df
                 text  code
        0      Ragnar     1
        1  sammywemmy     2
        2      ginger     3
        >>> df.process_text(column_name="text", string_function="lower")
                 text  code
        0      ragnar     1
        1  sammywemmy     2
        2      ginger     3

        For string methods with parameters, simply pass the keyword arguments:

        >>> df.process_text(
        ...     column_name="text",
        ...     string_function="extract",
        ...     pat=r"(ag)",
        ...     expand=False,
        ...     flags=re.IGNORECASE,
        ... )
          text  code
        0   ag     1
        1  NaN     2
        2  NaN     3

    Args:
        df: A pandas DataFrame.
        column_name: String column to be operated on.
        string_function: pandas string method to be applied.
        **kwargs: Keyword arguments for parameters of the `string_function`.

    Raises:
        KeyError: If `string_function` is not a Pandas string method.
        ValueError: If the text function returns a DataFrame, instead of a Series.

    Returns:
        A pandas DataFrame with modified column.
    """  # noqa: E501

    check("column_name", column_name, [str])
    check("string_function", string_function, [str])
    check_column(df, [column_name])

    pandas_string_methods = [
        func.__name__
        for _, func in inspect.getmembers(pd.Series.str, inspect.isfunction)
        if not func.__name__.startswith("_")
    ]

    if string_function not in pandas_string_methods:
        raise KeyError(f"{string_function} is not a Pandas string method.")

    result = getattr(df[column_name].str, string_function)(**kwargs)

    if isinstance(result, pd.DataFrame):
        raise ValueError(
            "The outcome of the processed text is a DataFrame, "
            "which is not supported in `process_text`."
        )

    return df.assign(**{column_name: result})

remove_columns

Implementation of remove_columns.

remove_columns(df, column_names)

Remove the set of columns specified in column_names.

This method does not mutate the original DataFrame.

Intended to be the method-chaining alternative to del df[col].

Note

This function will be deprecated in a 1.x release. Kindly use pd.DataFrame.drop instead.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": [2, 4, 6], "b": [1, 3, 5], "c": [7, 8, 9]})
>>> df
   a  b  c
0  2  1  7
1  4  3  8
2  6  5  9
>>> df.remove_columns(column_names=['a', 'c'])
   b
0  1
1  3
2  5

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_names Union[str, Iterable[str], Hashable]

The columns to remove.

required

Returns:

Type Description
DataFrame

A pandas DataFrame.

Source code in janitor/functions/remove_columns.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
@pf.register_dataframe_method
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `pd.DataFrame.drop` instead."
    )
)
@deprecated_alias(columns="column_names")
def remove_columns(
    df: pd.DataFrame,
    column_names: Union[str, Iterable[str], Hashable],
) -> pd.DataFrame:
    """Remove the set of columns specified in `column_names`.

    This method does not mutate the original DataFrame.

    Intended to be the method-chaining alternative to `del df[col]`.

    !!!note

        This function will be deprecated in a 1.x release.
        Kindly use `pd.DataFrame.drop` instead.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"a": [2, 4, 6], "b": [1, 3, 5], "c": [7, 8, 9]})
        >>> df
           a  b  c
        0  2  1  7
        1  4  3  8
        2  6  5  9
        >>> df.remove_columns(column_names=['a', 'c'])
           b
        0  1
        1  3
        2  5

    Args:
        df: A pandas DataFrame.
        column_names: The columns to remove.

    Returns:
        A pandas DataFrame.
    """

    return df.drop(columns=column_names)

remove_empty

Implementation of remove_empty.

remove_empty(df, reset_index=True)

Drop all rows and columns that are completely null.

This method does not mutate the original DataFrame.

Implementation is inspired from StackOverflow.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "a": [1, np.nan, 2],
...     "b": [3, np.nan, 4],
...     "c": [np.nan, np.nan, np.nan],
... })
>>> df
     a    b   c
0  1.0  3.0 NaN
1  NaN  NaN NaN
2  2.0  4.0 NaN
>>> df.remove_empty()
     a    b
0  1.0  3.0
1  2.0  4.0

Parameters:

Name Type Description Default
df DataFrame

The pandas DataFrame object.

required
reset_index bool

Determines if the index is reset.

True

Returns:

Type Description
DataFrame

A pandas DataFrame.

Source code in janitor/functions/remove_empty.py
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
@pf.register_dataframe_method
def remove_empty(df: pd.DataFrame, reset_index: bool = True) -> pd.DataFrame:
    """Drop all rows and columns that are completely null.

    This method does not mutate the original DataFrame.

    Implementation is inspired from [StackOverflow][so].

    [so]: https://stackoverflow.com/questions/38884538/python-pandas-find-all-rows-where-all-values-are-nan

    Examples:
        >>> import numpy as np
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "a": [1, np.nan, 2],
        ...     "b": [3, np.nan, 4],
        ...     "c": [np.nan, np.nan, np.nan],
        ... })
        >>> df
             a    b   c
        0  1.0  3.0 NaN
        1  NaN  NaN NaN
        2  2.0  4.0 NaN
        >>> df.remove_empty()
             a    b
        0  1.0  3.0
        1  2.0  4.0

    Args:
        df: The pandas DataFrame object.
        reset_index: Determines if the index is reset.

    Returns:
        A pandas DataFrame.
    """  # noqa: E501
    outcome = df.isna()
    outcome = df.loc[~outcome.all(axis=1), ~outcome.all(axis=0)]
    if reset_index:
        return outcome.reset_index(drop=True)
    return outcome

rename_columns

rename_column(df, old_column_name, new_column_name)

Rename a column in place.

This method does not mutate the original DataFrame.

Note

This function will be deprecated in a 1.x release. Please use pd.DataFrame.rename instead.

This is just syntactic sugar/a convenience function for renaming one column at a time. If you are convinced that there are multiple columns in need of changing, then use the pandas.DataFrame.rename method.

Examples:

Change the name of column 'a' to 'a_new'.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
>>> df.rename_column(old_column_name='a', new_column_name='a_new')
   a_new  b
0      0  a
1      1  b
2      2  c

Parameters:

Name Type Description Default
df DataFrame

The pandas DataFrame object.

required
old_column_name str

The old column name.

required
new_column_name str

The new column name.

required

Returns:

Type Description
DataFrame

A pandas DataFrame with renamed columns.

Source code in janitor/functions/rename_columns.py
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
@pf.register_dataframe_method
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `pd.DataFrame.rename` instead."
    )
)
@deprecated_alias(old="old_column_name", new="new_column_name")
def rename_column(
    df: pd.DataFrame,
    old_column_name: str,
    new_column_name: str,
) -> pd.DataFrame:
    """Rename a column in place.

    This method does not mutate the original DataFrame.

    !!!note

        This function will be deprecated in a 1.x release.
        Please use `pd.DataFrame.rename` instead.

    This is just syntactic sugar/a convenience function for renaming one column at a time.
    If you are convinced that there are multiple columns in need of changing,
    then use the `pandas.DataFrame.rename` method.

    Examples:
        Change the name of column 'a' to 'a_new'.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
        >>> df.rename_column(old_column_name='a', new_column_name='a_new')
           a_new  b
        0      0  a
        1      1  b
        2      2  c

    Args:
        df: The pandas DataFrame object.
        old_column_name: The old column name.
        new_column_name: The new column name.

    Returns:
        A pandas DataFrame with renamed columns.
    """  # noqa: E501

    check_column(df, [old_column_name])

    return df.rename(columns={old_column_name: new_column_name})

rename_columns(df, new_column_names=None, function=None)

Rename columns.

This method does not mutate the original DataFrame.

Note

This function will be deprecated in a 1.x release. Please use pd.DataFrame.rename instead.

One of the new_column_names or function are a required parameter. If both are provided, then new_column_names takes priority and function is never executed.

Examples:

Rename columns using a dictionary which maps old names to new names.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": list(range(3)), "b": list("xyz")})
>>> df
   a  b
0  0  x
1  1  y
2  2  z
>>> df.rename_columns(new_column_names={"a": "a_new", "b": "b_new"})
   a_new b_new
0      0     x
1      1     y
2      2     z

Rename columns using a generic callable.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": list(range(3)), "b": list("xyz")})
>>> df.rename_columns(function=str.upper)
   A  B
0  0  x
1  1  y
2  2  z

Parameters:

Name Type Description Default
df DataFrame

The pandas DataFrame object.

required
new_column_names Union[Dict, None]

A dictionary of old and new column names.

None
function Callable

A function which should be applied to all the columns.

None

Raises:

Type Description
ValueError

If both new_column_names and function are None.

Returns:

Type Description
DataFrame

A pandas DataFrame with renamed columns.

Source code in janitor/functions/rename_columns.py
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
@pf.register_dataframe_method
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `pd.DataFrame.rename` instead."
    )
)
def rename_columns(
    df: pd.DataFrame,
    new_column_names: Union[Dict, None] = None,
    function: Callable = None,
) -> pd.DataFrame:
    """Rename columns.

    This method does not mutate the original DataFrame.

    !!!note

        This function will be deprecated in a 1.x release.
        Please use `pd.DataFrame.rename` instead.

    One of the `new_column_names` or `function` are a required parameter.
    If both are provided, then `new_column_names` takes priority and `function`
    is never executed.

    Examples:
        Rename columns using a dictionary which maps old names to new names.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"a": list(range(3)), "b": list("xyz")})
        >>> df
           a  b
        0  0  x
        1  1  y
        2  2  z
        >>> df.rename_columns(new_column_names={"a": "a_new", "b": "b_new"})
           a_new b_new
        0      0     x
        1      1     y
        2      2     z

        Rename columns using a generic callable.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"a": list(range(3)), "b": list("xyz")})
        >>> df.rename_columns(function=str.upper)
           A  B
        0  0  x
        1  1  y
        2  2  z

    Args:
        df: The pandas DataFrame object.
        new_column_names: A dictionary of old and new column names.
        function: A function which should be applied to all the columns.

    Raises:
        ValueError: If both `new_column_names` and `function` are None.

    Returns:
        A pandas DataFrame with renamed columns.
    """  # noqa: E501

    if new_column_names is None and function is None:
        raise ValueError(
            "One of new_column_names or function must be provided"
        )

    if new_column_names is not None:
        check_column(df, new_column_names)
        return df.rename(columns=new_column_names)

    return df.rename(mapper=function, axis="columns")

reorder_columns

Implementation source for reorder_columns.

reorder_columns(df, column_order)

Reorder DataFrame columns by specifying desired order as list of col names.

Columns not specified retain their order and follow after the columns specified in column_order.

All columns specified within the column_order list must be present within df.

This method does not mutate the original DataFrame.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"col1": [1, 1, 1], "col2": [2, 2, 2], "col3": [3, 3, 3]})
>>> df
   col1  col2  col3
0     1     2     3
1     1     2     3
2     1     2     3
>>> df.reorder_columns(['col3', 'col1'])
   col3  col1  col2
0     3     1     2
1     3     1     2
2     3     1     2

Notice that the column order of df is now col3, col1, col2.

Internally, this function uses DataFrame.reindex with copy=False to avoid unnecessary data duplication.

Parameters:

Name Type Description Default
df DataFrame

DataFrame to reorder

required
column_order Union[Iterable[str], Index, Hashable]

A list of column names or Pandas Index specifying their order in the returned DataFrame.

required

Raises:

Type Description
IndexError

If a column within column_order is not found within the DataFrame.

Returns:

Type Description
DataFrame

A pandas DataFrame with reordered columns.

Source code in janitor/functions/reorder_columns.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
@pf.register_dataframe_method
def reorder_columns(
    df: pd.DataFrame, column_order: Union[Iterable[str], pd.Index, Hashable]
) -> pd.DataFrame:
    """Reorder DataFrame columns by specifying desired order as list of col names.

    Columns not specified retain their order and follow after the columns specified
    in `column_order`.

    All columns specified within the `column_order` list must be present within `df`.

    This method does not mutate the original DataFrame.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"col1": [1, 1, 1], "col2": [2, 2, 2], "col3": [3, 3, 3]})
        >>> df
           col1  col2  col3
        0     1     2     3
        1     1     2     3
        2     1     2     3
        >>> df.reorder_columns(['col3', 'col1'])
           col3  col1  col2
        0     3     1     2
        1     3     1     2
        2     3     1     2

        Notice that the column order of `df` is now `col3`, `col1`, `col2`.

    Internally, this function uses `DataFrame.reindex` with `copy=False`
    to avoid unnecessary data duplication.

    Args:
        df: `DataFrame` to reorder
        column_order: A list of column names or Pandas `Index`
            specifying their order in the returned `DataFrame`.

    Raises:
        IndexError: If a column within `column_order` is not found
            within the DataFrame.

    Returns:
        A pandas DataFrame with reordered columns.
    """  # noqa: E501
    check("column_order", column_order, [list, tuple, pd.Index])

    if any(col not in df.columns for col in column_order):
        raise IndexError(
            "One or more columns in `column_order` were not found in the "
            "DataFrame."
        )

    # if column_order is a Pandas index, needs conversion to list:
    column_order = list(column_order)

    return df.reindex(
        columns=(
            column_order
            + [col for col in df.columns if col not in column_order]
        ),
        copy=False,
    )

round_to_fraction

Implementation of round_to_fraction

round_to_fraction(df, column_name, denominator, digits=np.inf)

Round all values in a column to a fraction.

This method mutates the original DataFrame.

Taken from the R package.

Also, optionally round to a specified number of digits.

Examples:

Round numeric column to the nearest 1/4 value.

>>> import numpy as np
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "a1": [1.263, 2.499, np.nan],
...     "a2": ["x", "y", "z"],
... })
>>> df
      a1 a2
0  1.263  x
1  2.499  y
2    NaN  z
>>> df.round_to_fraction("a1", denominator=4)
     a1 a2
0  1.25  x
1  2.50  y
2   NaN  z

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_name Hashable

Name of column to round to fraction.

required
denominator float

The denominator of the fraction for rounding. Must be a positive number.

required
digits float

The number of digits for rounding after rounding to the fraction. Default is np.inf (i.e. no subsequent rounding).

inf

Raises:

Type Description
ValueError

If denominator is not a positive number.

Returns:

Type Description
DataFrame

A pandas DataFrame with a column's values rounded.

Source code in janitor/functions/round_to_fraction.py
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
@pf.register_dataframe_method
@deprecated_alias(col_name="column_name")
def round_to_fraction(
    df: pd.DataFrame,
    column_name: Hashable,
    denominator: float,
    digits: float = np.inf,
) -> pd.DataFrame:
    """Round all values in a column to a fraction.

    This method mutates the original DataFrame.

    Taken from [the R package](https://github.com/sfirke/janitor/issues/235).

    Also, optionally round to a specified number of digits.

    Examples:
        Round numeric column to the nearest 1/4 value.

        >>> import numpy as np
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "a1": [1.263, 2.499, np.nan],
        ...     "a2": ["x", "y", "z"],
        ... })
        >>> df
              a1 a2
        0  1.263  x
        1  2.499  y
        2    NaN  z
        >>> df.round_to_fraction("a1", denominator=4)
             a1 a2
        0  1.25  x
        1  2.50  y
        2   NaN  z

    Args:
        df: A pandas DataFrame.
        column_name: Name of column to round to fraction.
        denominator: The denominator of the fraction for rounding. Must be
            a positive number.
        digits: The number of digits for rounding after rounding to the
            fraction. Default is np.inf (i.e. no subsequent rounding).

    Raises:
        ValueError: If `denominator` is not a positive number.

    Returns:
        A pandas DataFrame with a column's values rounded.
    """
    check_column(df, column_name)
    check("denominator", denominator, [float, int])
    check("digits", digits, [float, int])

    if denominator <= 0:
        raise ValueError("denominator is expected to be a positive number.")

    df[column_name] = round(df[column_name] * denominator, 0) / denominator
    if not np.isinf(digits):
        df[column_name] = round(df[column_name], digits)

    return df

row_to_names

Implementation of the row_to_names function.

row_to_names(df, row_numbers=0, remove_rows=False, remove_rows_above=False, reset_index=False)

Elevates a row, or rows, to be the column names of a DataFrame.

This method does not mutate the original DataFrame.

Contains options to remove the elevated row from the DataFrame along with removing the rows above the selected row.

Examples:

Replace column names with the first row and reset the index.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "a": ["nums", 6, 9],
...     "b": ["chars", "x", "y"],
... })
>>> df
      a      b
0  nums  chars
1     6      x
2     9      y
>>> df.row_to_names(0, remove_rows=True, reset_index=True)
  nums chars
0    6     x
1    9     y
>>> df.row_to_names([0,1], remove_rows=True, reset_index=True)
  nums chars
  6    x
0    9     y

Remove rows above the elevated row and the elevated row itself.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "a": ["bla1", "nums", 6, 9],
...     "b": ["bla2", "chars", "x", "y"],
... })
>>> df
      a      b
0  bla1   bla2
1  nums  chars
2     6      x
3     9      y
>>> df.row_to_names(1, remove_rows=True, remove_rows_above=True, reset_index=True)
  nums chars
0    6     x
1    9     y

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
row_numbers int

Position of the row(s) containing the variable names. Note that indexing starts from 0. It can also be a list, in which case, a MultiIndex column is created. Defaults to 0 (first row).

0
remove_row

Whether the row(s) should be removed from the DataFrame.

required
remove_rows_above bool

Whether the row(s) above the selected row should be removed from the DataFrame.

False
reset_index bool

Whether the index should be reset on the returning DataFrame.

False

Returns:

Type Description
DataFrame

A pandas DataFrame with set column names.

Source code in janitor/functions/row_to_names.py
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
@pf.register_dataframe_method
@deprecated_alias(row_number="row_numbers", remove_row="remove_rows")
def row_to_names(
    df: pd.DataFrame,
    row_numbers: int = 0,
    remove_rows: bool = False,
    remove_rows_above: bool = False,
    reset_index: bool = False,
) -> pd.DataFrame:
    """Elevates a row, or rows, to be the column names of a DataFrame.

    This method does not mutate the original DataFrame.

    Contains options to remove the elevated row from the DataFrame along with
    removing the rows above the selected row.

    Examples:
        Replace column names with the first row and reset the index.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "a": ["nums", 6, 9],
        ...     "b": ["chars", "x", "y"],
        ... })
        >>> df
              a      b
        0  nums  chars
        1     6      x
        2     9      y
        >>> df.row_to_names(0, remove_rows=True, reset_index=True)
          nums chars
        0    6     x
        1    9     y
        >>> df.row_to_names([0,1], remove_rows=True, reset_index=True)
          nums chars
          6    x
        0    9     y

        Remove rows above the elevated row and the elevated row itself.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "a": ["bla1", "nums", 6, 9],
        ...     "b": ["bla2", "chars", "x", "y"],
        ... })
        >>> df
              a      b
        0  bla1   bla2
        1  nums  chars
        2     6      x
        3     9      y
        >>> df.row_to_names(1, remove_rows=True, remove_rows_above=True, reset_index=True)
          nums chars
        0    6     x
        1    9     y

    Args:
        df: A pandas DataFrame.
        row_numbers: Position of the row(s) containing the variable names.
            Note that indexing starts from 0. It can also be a list,
            in which case, a MultiIndex column is created.
            Defaults to 0 (first row).
        remove_row: Whether the row(s) should be removed from the DataFrame.
        remove_rows_above: Whether the row(s) above the selected row should
            be removed from the DataFrame.
        reset_index: Whether the index should be reset on the returning DataFrame.

    Returns:
        A pandas DataFrame with set column names.
    """  # noqa: E501
    if not pd.options.mode.copy_on_write:
        df = df.copy()

    check("row_number", row_numbers, [int, list])
    if isinstance(row_numbers, list):
        for entry in row_numbers:
            check("entry in the row_number argument", entry, [int])

    warnings.warn(
        "The function row_to_names will, in the official 1.0 release, "
        "change its behaviour to reset the dataframe's index by default. "
        "You can prepare for this change right now by explicitly setting "
        "`reset_index=True` when calling on `row_to_names`."
    )
    # should raise if positional indexers are missing
    # IndexError: positional indexers are out-of-bounds
    headers = df.iloc[row_numbers]
    if isinstance(headers, pd.DataFrame) and (len(headers) == 1):
        headers = headers.squeeze()
    if isinstance(headers, pd.Series):
        headers = pd.Index(headers)
    else:
        headers = [entry.array for _, entry in headers.items()]
        headers = pd.MultiIndex.from_tuples(headers)

    df.columns = headers
    df.columns.name = None

    df_index = df.index
    if remove_rows_above:
        if isinstance(row_numbers, list):
            if not (np.diff(row_numbers) == 1).all():
                raise ValueError(
                    "The remove_rows_above argument is applicable "
                    "only if the row_numbers argument is an integer, "
                    "or the integers in a list are consecutive increasing, "
                    "with a difference of 1."
                )
            tail = row_numbers[0]
        else:
            tail = row_numbers
        df = df.iloc[tail:]
    if remove_rows:
        if isinstance(row_numbers, int):
            row_numbers = [row_numbers]
        df_index = df.index.symmetric_difference(df_index[row_numbers])
        df = df.loc[df_index]
    if reset_index:
        df.index = range(len(df))
    return df

select

select(df, *args, index=None, columns=None, axis='columns', invert=False)

Method-chainable selection of rows and columns.

It accepts a string, shell-like glob strings (*string*), regex, slice, array-like object, or a list of the previous options.

Selection on a MultiIndex on a level, or multiple levels, is possible with a dictionary.

This method does not mutate the original DataFrame.

Selection can be inverted with the DropLabel class.

Optional ability to invert selection of index/columns available as well.

New in version 0.24.0

Note

The preferred option when selecting columns or rows in a Pandas DataFrame is with .loc or .iloc methods, as they are generally performant. select is primarily for convenience.

Version Changed

  • 0.26.0
    • Added variable args, invert and axis parameters.
    • rows keyword deprecated in favour of index.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
...      index=['cobra', 'viper', 'sidewinder'],
...      columns=['max_speed', 'shield'])
>>> df
            max_speed  shield
cobra               1       2
viper               4       5
sidewinder          7       8
>>> df.select(index='cobra', columns='shield')
       shield
cobra       2

Labels can be dropped with the DropLabel class:

>>> df.select(index=DropLabel('cobra'))
            max_speed  shield
viper               4       5
sidewinder          7       8

More examples can be found in the select_columns section.

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
*args

Valid inputs include: an exact index name to look for, a shell-style glob string (e.g. *_thing_*), a regular expression, a callable, or variable arguments of all the aforementioned. A sequence of booleans is also acceptable. A dictionary can be used for selection on a MultiIndex on different levels.

()
index Any

Valid inputs include: an exact label to look for, a shell-style glob string (e.g. *_thing_*), a regular expression, a callable, or variable arguments of all the aforementioned. A sequence of booleans is also acceptable. A dictionary can be used for selection on a MultiIndex on different levels.

None
columns Any

Valid inputs include: an exact label to look for, a shell-style glob string (e.g. *_thing_*), a regular expression, a callable, or variable arguments of all the aforementioned. A sequence of booleans is also acceptable. A dictionary can be used for selection on a MultiIndex on different levels.

None
invert bool

Whether or not to invert the selection. This will result in the selection of the complement of the rows/columns provided.

False
axis str

Whether the selection should be on the index('index'), or columns('columns'). Applicable only for the variable args parameter.

'columns'

Raises:

Type Description
ValueError

If args and index/columns are provided.

Returns:

Type Description
DataFrame

A pandas DataFrame with the specified rows and/or columns selected.

Source code in janitor/functions/select.py
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
@pf.register_dataframe_method
@deprecated_alias(rows="index")
def select(
    df: pd.DataFrame,
    *args,
    index: Any = None,
    columns: Any = None,
    axis: str = "columns",
    invert: bool = False,
) -> pd.DataFrame:
    """Method-chainable selection of rows and columns.

    It accepts a string, shell-like glob strings `(*string*)`,
    regex, slice, array-like object, or a list of the previous options.

    Selection on a MultiIndex on a level, or multiple levels,
    is possible with a dictionary.

    This method does not mutate the original DataFrame.

    Selection can be inverted with the `DropLabel` class.

    Optional ability to invert selection of index/columns available as well.


    !!! info "New in version 0.24.0"


    !!!note

        The preferred option when selecting columns or rows in a Pandas DataFrame
        is with `.loc` or `.iloc` methods, as they are generally performant.
        `select` is primarily for convenience.

    !!! abstract "Version Changed"

        - 0.26.0
            - Added variable `args`, `invert` and `axis` parameters.
            - `rows` keyword deprecated in favour of `index`.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
        ...      index=['cobra', 'viper', 'sidewinder'],
        ...      columns=['max_speed', 'shield'])
        >>> df
                    max_speed  shield
        cobra               1       2
        viper               4       5
        sidewinder          7       8
        >>> df.select(index='cobra', columns='shield')
               shield
        cobra       2

        Labels can be dropped with the `DropLabel` class:

        >>> df.select(index=DropLabel('cobra'))
                    max_speed  shield
        viper               4       5
        sidewinder          7       8

    More examples can be found in the
    [`select_columns`][janitor.functions.select.select_columns] section.

    Args:
        df: A pandas DataFrame.
        *args: Valid inputs include: an exact index name to look for,
            a shell-style glob string (e.g. `*_thing_*`),
            a regular expression,
            a callable,
            or variable arguments of all the aforementioned.
            A sequence of booleans is also acceptable.
            A dictionary can be used for selection
            on a MultiIndex on different levels.
        index: Valid inputs include: an exact label to look for,
            a shell-style glob string (e.g. `*_thing_*`),
            a regular expression,
            a callable,
            or variable arguments of all the aforementioned.
            A sequence of booleans is also acceptable.
            A dictionary can be used for selection
            on a MultiIndex on different levels.
        columns: Valid inputs include: an exact label to look for,
            a shell-style glob string (e.g. `*_thing_*`),
            a regular expression,
            a callable,
            or variable arguments of all the aforementioned.
            A sequence of booleans is also acceptable.
            A dictionary can be used for selection
            on a MultiIndex on different levels.
        invert: Whether or not to invert the selection.
            This will result in the selection
            of the complement of the rows/columns provided.
        axis: Whether the selection should be on the index('index'),
            or columns('columns').
            Applicable only for the variable args parameter.

    Raises:
        ValueError: If args and index/columns are provided.

    Returns:
        A pandas DataFrame with the specified rows and/or columns selected.
    """  # noqa: E501

    if args:
        check("invert", invert, [bool])
        if (index is not None) or (columns is not None):
            raise ValueError(
                "Either provide variable args with the axis parameter, "
                "or provide arguments to the index and/or columns parameters."
            )
        if axis == "index":
            return _select(df, rows=list(args), columns=columns, invert=invert)
        if axis == "columns":
            return _select(df, columns=list(args), rows=index, invert=invert)
        raise ValueError("axis should be either 'index' or 'columns'.")
    return _select(df, rows=index, columns=columns, invert=invert)

select_columns(df, *args, invert=False)

Method-chainable selection of columns.

It accepts a string, shell-like glob strings (*string*), regex, slice, array-like object, or a list of the previous options.

Selection on a MultiIndex on a level, or multiple levels, is possible with a dictionary.

This method does not mutate the original DataFrame.

Optional ability to invert selection of columns available as well.

Note

The preferred option when selecting columns or rows in a Pandas DataFrame is with .loc or .iloc methods. select_columns is primarily for convenience.

Note

This function will be deprecated in a 1.x release. Please use jn.select instead.

Examples:

>>> import pandas as pd
>>> import janitor
>>> from numpy import nan
>>> pd.set_option("display.max_columns", None)
>>> pd.set_option("display.expand_frame_repr", False)
>>> pd.set_option("max_colwidth", None)
>>> data = {'name': ['Cheetah','Owl monkey','Mountain beaver',
...                  'Greater short-tailed shrew','Cow'],
...         'genus': ['Acinonyx', 'Aotus', 'Aplodontia', 'Blarina', 'Bos'],
...         'vore': ['carni', 'omni', 'herbi', 'omni', 'herbi'],
...         'order': ['Carnivora','Primates','Rodentia','Soricomorpha','Artiodactyla'],
...         'conservation': ['lc', nan, 'nt', 'lc', 'domesticated'],
...         'sleep_total': [12.1, 17.0, 14.4, 14.9, 4.0],
...         'sleep_rem': [nan, 1.8, 2.4, 2.3, 0.7],
...         'sleep_cycle': [nan, nan, nan, 0.133333333, 0.666666667],
...         'awake': [11.9, 7.0, 9.6, 9.1, 20.0],
...         'brainwt': [nan, 0.0155, nan, 0.00029, 0.423],
...         'bodywt': [50.0, 0.48, 1.35, 0.019, 600.0]}
>>> df = pd.DataFrame(data)
>>> df
                         name       genus   vore         order  conservation  sleep_total  sleep_rem  sleep_cycle  awake  brainwt   bodywt
0                     Cheetah    Acinonyx  carni     Carnivora            lc         12.1        NaN          NaN   11.9      NaN   50.000
1                  Owl monkey       Aotus   omni      Primates           NaN         17.0        1.8          NaN    7.0  0.01550    0.480
2             Mountain beaver  Aplodontia  herbi      Rodentia            nt         14.4        2.4          NaN    9.6      NaN    1.350
3  Greater short-tailed shrew     Blarina   omni  Soricomorpha            lc         14.9        2.3     0.133333    9.1  0.00029    0.019
4                         Cow         Bos  herbi  Artiodactyla  domesticated          4.0        0.7     0.666667   20.0  0.42300  600.000

Explicit label selection:

>>> df.select_columns('name', 'order')
                         name         order
0                     Cheetah     Carnivora
1                  Owl monkey      Primates
2             Mountain beaver      Rodentia
3  Greater short-tailed shrew  Soricomorpha
4                         Cow  Artiodactyla

Selection via globbing:

>>> df.select_columns("sleep*", "*wt")
   sleep_total  sleep_rem  sleep_cycle  brainwt   bodywt
0         12.1        NaN          NaN      NaN   50.000
1         17.0        1.8          NaN  0.01550    0.480
2         14.4        2.4          NaN      NaN    1.350
3         14.9        2.3     0.133333  0.00029    0.019
4          4.0        0.7     0.666667  0.42300  600.000

Selection via regex:

>>> import re
>>> df.select_columns(re.compile(r"o.+er"))
          order  conservation
0     Carnivora            lc
1      Primates           NaN
2      Rodentia            nt
3  Soricomorpha            lc
4  Artiodactyla  domesticated

Selection via slicing:

>>> df.select_columns(slice('name','order'), slice('sleep_total','sleep_cycle'))
                         name       genus   vore         order  sleep_total  sleep_rem  sleep_cycle
0                     Cheetah    Acinonyx  carni     Carnivora         12.1        NaN          NaN
1                  Owl monkey       Aotus   omni      Primates         17.0        1.8          NaN
2             Mountain beaver  Aplodontia  herbi      Rodentia         14.4        2.4          NaN
3  Greater short-tailed shrew     Blarina   omni  Soricomorpha         14.9        2.3     0.133333
4                         Cow         Bos  herbi  Artiodactyla          4.0        0.7     0.666667

Selection via callable:

>>> from pandas.api.types import is_numeric_dtype
>>> df.select_columns(is_numeric_dtype)
   sleep_total  sleep_rem  sleep_cycle  awake  brainwt   bodywt
0         12.1        NaN          NaN   11.9      NaN   50.000
1         17.0        1.8          NaN    7.0  0.01550    0.480
2         14.4        2.4          NaN    9.6      NaN    1.350
3         14.9        2.3     0.133333    9.1  0.00029    0.019
4          4.0        0.7     0.666667   20.0  0.42300  600.000
>>> df.select_columns(lambda f: f.isna().any())
   conservation  sleep_rem  sleep_cycle  brainwt
0            lc        NaN          NaN      NaN
1           NaN        1.8          NaN  0.01550
2            nt        2.4          NaN      NaN
3            lc        2.3     0.133333  0.00029
4  domesticated        0.7     0.666667  0.42300

Exclude columns with the invert parameter:

>>> df.select_columns(is_numeric_dtype, invert=True)
                         name       genus   vore         order  conservation
0                     Cheetah    Acinonyx  carni     Carnivora            lc
1                  Owl monkey       Aotus   omni      Primates           NaN
2             Mountain beaver  Aplodontia  herbi      Rodentia            nt
3  Greater short-tailed shrew     Blarina   omni  Soricomorpha            lc
4                         Cow         Bos  herbi  Artiodactyla  domesticated

Exclude columns with the DropLabel class:

>>> from janitor import DropLabel
>>> df.select_columns(DropLabel(slice("name", "awake")), "conservation")
   brainwt   bodywt  conservation
0      NaN   50.000            lc
1  0.01550    0.480           NaN
2      NaN    1.350            nt
3  0.00029    0.019            lc
4  0.42300  600.000  domesticated

Selection on MultiIndex columns:

>>> d = {'num_legs': [4, 4, 2, 2],
...      'num_wings': [0, 0, 2, 2],
...      'class': ['mammal', 'mammal', 'mammal', 'bird'],
...      'animal': ['cat', 'dog', 'bat', 'penguin'],
...      'locomotion': ['walks', 'walks', 'flies', 'walks']}
>>> df = pd.DataFrame(data=d)
>>> df = df.set_index(['class', 'animal', 'locomotion']).T
>>> df
class      mammal                bird
animal        cat   dog   bat penguin
locomotion  walks walks flies   walks
num_legs        4     4     2       2
num_wings       0     0     2       2

Selection with a scalar:

>>> df.select_columns('mammal')
class      mammal
animal        cat   dog   bat
locomotion  walks walks flies
num_legs        4     4     2
num_wings       0     0     2

Selection with a tuple:

>>> df.select_columns(('mammal','bat'))
class      mammal
animal        bat
locomotion  flies
num_legs        2
num_wings       2

Selection within a level is possible with a dictionary, where the key is either a level name or number:

>>> df.select_columns({'animal':'cat'})
class      mammal
animal        cat
locomotion  walks
num_legs        4
num_wings       0
>>> df.select_columns({1:["bat", "cat"]})
class      mammal
animal        bat   cat
locomotion  flies walks
num_legs        2     4
num_wings       2     0

Selection on multiple levels:

>>> df.select_columns({"class":"mammal", "locomotion":"flies"})
class      mammal
animal        bat
locomotion  flies
num_legs        2
num_wings       2

Selection with a regex on a level:

>>> df.select_columns({"animal":re.compile(".+t$")})
class      mammal
animal        cat   bat
locomotion  walks flies
num_legs        4     2
num_wings       0     2

Selection with a callable on a level:

>>> df.select_columns({"animal":lambda f: f.str.endswith('t')})
class      mammal
animal        cat   bat
locomotion  walks flies
num_legs        4     2
num_wings       0     2

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
*args Any

Valid inputs include: an exact column name to look for, a shell-style glob string (e.g. *_thing_*), a regular expression, a callable, or variable arguments of all the aforementioned. A sequence of booleans is also acceptable. A dictionary can be used for selection on a MultiIndex on different levels.

()
invert bool

Whether or not to invert the selection. This will result in the selection of the complement of the columns provided.

False

Returns:

Type Description
DataFrame

A pandas DataFrame with the specified columns selected.

Source code in janitor/functions/select.py
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
@pf.register_dataframe_method
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `jn.select` instead."
    )
)
def select_columns(
    df: pd.DataFrame,
    *args: Any,
    invert: bool = False,
) -> pd.DataFrame:
    """Method-chainable selection of columns.

    It accepts a string, shell-like glob strings `(*string*)`,
    regex, slice, array-like object, or a list of the previous options.

    Selection on a MultiIndex on a level, or multiple levels,
    is possible with a dictionary.

    This method does not mutate the original DataFrame.

    Optional ability to invert selection of columns available as well.

    !!!note

        The preferred option when selecting columns or rows in a Pandas DataFrame
        is with `.loc` or `.iloc` methods.
        `select_columns` is primarily for convenience.

    !!!note

        This function will be deprecated in a 1.x release.
        Please use `jn.select` instead.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> from numpy import nan
        >>> pd.set_option("display.max_columns", None)
        >>> pd.set_option("display.expand_frame_repr", False)
        >>> pd.set_option("max_colwidth", None)
        >>> data = {'name': ['Cheetah','Owl monkey','Mountain beaver',
        ...                  'Greater short-tailed shrew','Cow'],
        ...         'genus': ['Acinonyx', 'Aotus', 'Aplodontia', 'Blarina', 'Bos'],
        ...         'vore': ['carni', 'omni', 'herbi', 'omni', 'herbi'],
        ...         'order': ['Carnivora','Primates','Rodentia','Soricomorpha','Artiodactyla'],
        ...         'conservation': ['lc', nan, 'nt', 'lc', 'domesticated'],
        ...         'sleep_total': [12.1, 17.0, 14.4, 14.9, 4.0],
        ...         'sleep_rem': [nan, 1.8, 2.4, 2.3, 0.7],
        ...         'sleep_cycle': [nan, nan, nan, 0.133333333, 0.666666667],
        ...         'awake': [11.9, 7.0, 9.6, 9.1, 20.0],
        ...         'brainwt': [nan, 0.0155, nan, 0.00029, 0.423],
        ...         'bodywt': [50.0, 0.48, 1.35, 0.019, 600.0]}
        >>> df = pd.DataFrame(data)
        >>> df
                                 name       genus   vore         order  conservation  sleep_total  sleep_rem  sleep_cycle  awake  brainwt   bodywt
        0                     Cheetah    Acinonyx  carni     Carnivora            lc         12.1        NaN          NaN   11.9      NaN   50.000
        1                  Owl monkey       Aotus   omni      Primates           NaN         17.0        1.8          NaN    7.0  0.01550    0.480
        2             Mountain beaver  Aplodontia  herbi      Rodentia            nt         14.4        2.4          NaN    9.6      NaN    1.350
        3  Greater short-tailed shrew     Blarina   omni  Soricomorpha            lc         14.9        2.3     0.133333    9.1  0.00029    0.019
        4                         Cow         Bos  herbi  Artiodactyla  domesticated          4.0        0.7     0.666667   20.0  0.42300  600.000

        Explicit label selection:
        >>> df.select_columns('name', 'order')
                                 name         order
        0                     Cheetah     Carnivora
        1                  Owl monkey      Primates
        2             Mountain beaver      Rodentia
        3  Greater short-tailed shrew  Soricomorpha
        4                         Cow  Artiodactyla

        Selection via globbing:
        >>> df.select_columns("sleep*", "*wt")
           sleep_total  sleep_rem  sleep_cycle  brainwt   bodywt
        0         12.1        NaN          NaN      NaN   50.000
        1         17.0        1.8          NaN  0.01550    0.480
        2         14.4        2.4          NaN      NaN    1.350
        3         14.9        2.3     0.133333  0.00029    0.019
        4          4.0        0.7     0.666667  0.42300  600.000

        Selection via regex:
        >>> import re
        >>> df.select_columns(re.compile(r"o.+er"))
                  order  conservation
        0     Carnivora            lc
        1      Primates           NaN
        2      Rodentia            nt
        3  Soricomorpha            lc
        4  Artiodactyla  domesticated

        Selection via slicing:
        >>> df.select_columns(slice('name','order'), slice('sleep_total','sleep_cycle'))
                                 name       genus   vore         order  sleep_total  sleep_rem  sleep_cycle
        0                     Cheetah    Acinonyx  carni     Carnivora         12.1        NaN          NaN
        1                  Owl monkey       Aotus   omni      Primates         17.0        1.8          NaN
        2             Mountain beaver  Aplodontia  herbi      Rodentia         14.4        2.4          NaN
        3  Greater short-tailed shrew     Blarina   omni  Soricomorpha         14.9        2.3     0.133333
        4                         Cow         Bos  herbi  Artiodactyla          4.0        0.7     0.666667

        Selection via callable:
        >>> from pandas.api.types import is_numeric_dtype
        >>> df.select_columns(is_numeric_dtype)
           sleep_total  sleep_rem  sleep_cycle  awake  brainwt   bodywt
        0         12.1        NaN          NaN   11.9      NaN   50.000
        1         17.0        1.8          NaN    7.0  0.01550    0.480
        2         14.4        2.4          NaN    9.6      NaN    1.350
        3         14.9        2.3     0.133333    9.1  0.00029    0.019
        4          4.0        0.7     0.666667   20.0  0.42300  600.000
        >>> df.select_columns(lambda f: f.isna().any())
           conservation  sleep_rem  sleep_cycle  brainwt
        0            lc        NaN          NaN      NaN
        1           NaN        1.8          NaN  0.01550
        2            nt        2.4          NaN      NaN
        3            lc        2.3     0.133333  0.00029
        4  domesticated        0.7     0.666667  0.42300

        Exclude columns with the `invert` parameter:
        >>> df.select_columns(is_numeric_dtype, invert=True)
                                 name       genus   vore         order  conservation
        0                     Cheetah    Acinonyx  carni     Carnivora            lc
        1                  Owl monkey       Aotus   omni      Primates           NaN
        2             Mountain beaver  Aplodontia  herbi      Rodentia            nt
        3  Greater short-tailed shrew     Blarina   omni  Soricomorpha            lc
        4                         Cow         Bos  herbi  Artiodactyla  domesticated

        Exclude columns with the `DropLabel` class:
        >>> from janitor import DropLabel
        >>> df.select_columns(DropLabel(slice("name", "awake")), "conservation")
           brainwt   bodywt  conservation
        0      NaN   50.000            lc
        1  0.01550    0.480           NaN
        2      NaN    1.350            nt
        3  0.00029    0.019            lc
        4  0.42300  600.000  domesticated

        Selection on MultiIndex columns:
        >>> d = {'num_legs': [4, 4, 2, 2],
        ...      'num_wings': [0, 0, 2, 2],
        ...      'class': ['mammal', 'mammal', 'mammal', 'bird'],
        ...      'animal': ['cat', 'dog', 'bat', 'penguin'],
        ...      'locomotion': ['walks', 'walks', 'flies', 'walks']}
        >>> df = pd.DataFrame(data=d)
        >>> df = df.set_index(['class', 'animal', 'locomotion']).T
        >>> df
        class      mammal                bird
        animal        cat   dog   bat penguin
        locomotion  walks walks flies   walks
        num_legs        4     4     2       2
        num_wings       0     0     2       2

        Selection with a scalar:
        >>> df.select_columns('mammal')
        class      mammal
        animal        cat   dog   bat
        locomotion  walks walks flies
        num_legs        4     4     2
        num_wings       0     0     2

        Selection with a tuple:
        >>> df.select_columns(('mammal','bat'))
        class      mammal
        animal        bat
        locomotion  flies
        num_legs        2
        num_wings       2

        Selection within a level is possible with a dictionary,
        where the key is either a level name or number:
        >>> df.select_columns({'animal':'cat'})
        class      mammal
        animal        cat
        locomotion  walks
        num_legs        4
        num_wings       0
        >>> df.select_columns({1:["bat", "cat"]})
        class      mammal
        animal        bat   cat
        locomotion  flies walks
        num_legs        2     4
        num_wings       2     0

        Selection on multiple levels:
        >>> df.select_columns({"class":"mammal", "locomotion":"flies"})
        class      mammal
        animal        bat
        locomotion  flies
        num_legs        2
        num_wings       2

        Selection with a regex on a level:
        >>> df.select_columns({"animal":re.compile(".+t$")})
        class      mammal
        animal        cat   bat
        locomotion  walks flies
        num_legs        4     2
        num_wings       0     2

        Selection with a callable on a level:
        >>> df.select_columns({"animal":lambda f: f.str.endswith('t')})
        class      mammal
        animal        cat   bat
        locomotion  walks flies
        num_legs        4     2
        num_wings       0     2

    Args:
        df: A pandas DataFrame.
        *args: Valid inputs include: an exact column name to look for,
            a shell-style glob string (e.g. `*_thing_*`),
            a regular expression,
            a callable,
            or variable arguments of all the aforementioned.
            A sequence of booleans is also acceptable.
            A dictionary can be used for selection
            on a MultiIndex on different levels.
        invert: Whether or not to invert the selection.
            This will result in the selection
            of the complement of the columns provided.

    Returns:
        A pandas DataFrame with the specified columns selected.
    """  # noqa: E501

    return _select(df, columns=list(args), invert=invert)

select_rows(df, *args, invert=False)

Method-chainable selection of rows.

It accepts a string, shell-like glob strings (*string*), regex, slice, array-like object, or a list of the previous options.

Selection on a MultiIndex on a level, or multiple levels, is possible with a dictionary.

This method does not mutate the original DataFrame.

Optional ability to invert selection of rows available as well.

New in version 0.24.0

Note

The preferred option when selecting columns or rows in a Pandas DataFrame is with .loc or .iloc methods, as they are generally performant. select_rows is primarily for convenience.

Note

This function will be deprecated in a 1.x release. Please use jn.select instead.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = {"col1": [1, 2], "foo": [3, 4], "col2": [5, 6]}
>>> df = pd.DataFrame.from_dict(df, orient='index')
>>> df
      0  1
col1  1  2
foo   3  4
col2  5  6
>>> df.select_rows("col*")
      0  1
col1  1  2
col2  5  6

More examples can be found in the select_columns section.

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
*args Any

Valid inputs include: an exact index name to look for, a shell-style glob string (e.g. *_thing_*), a regular expression, a callable, or variable arguments of all the aforementioned. A sequence of booleans is also acceptable. A dictionary can be used for selection on a MultiIndex on different levels.

()
invert bool

Whether or not to invert the selection. This will result in the selection of the complement of the rows provided.

False

Returns:

Type Description
DataFrame

A pandas DataFrame with the specified rows selected.

Source code in janitor/functions/select.py
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
@pf.register_dataframe_method
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `jn.select` instead."
    )
)
def select_rows(
    df: pd.DataFrame,
    *args: Any,
    invert: bool = False,
) -> pd.DataFrame:
    """Method-chainable selection of rows.

    It accepts a string, shell-like glob strings `(*string*)`,
    regex, slice, array-like object, or a list of the previous options.

    Selection on a MultiIndex on a level, or multiple levels,
    is possible with a dictionary.

    This method does not mutate the original DataFrame.

    Optional ability to invert selection of rows available as well.


    !!! info "New in version 0.24.0"

    !!!note

        The preferred option when selecting columns or rows in a Pandas DataFrame
        is with `.loc` or `.iloc` methods, as they are generally performant.
        `select_rows` is primarily for convenience.

    !!!note

        This function will be deprecated in a 1.x release.
        Please use `jn.select` instead.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = {"col1": [1, 2], "foo": [3, 4], "col2": [5, 6]}
        >>> df = pd.DataFrame.from_dict(df, orient='index')
        >>> df
              0  1
        col1  1  2
        foo   3  4
        col2  5  6
        >>> df.select_rows("col*")
              0  1
        col1  1  2
        col2  5  6

    More examples can be found in the
    [`select_columns`][janitor.functions.select.select_columns] section.

    Args:
        df: A pandas DataFrame.
        *args: Valid inputs include: an exact index name to look for,
            a shell-style glob string (e.g. `*_thing_*`),
            a regular expression,
            a callable,
            or variable arguments of all the aforementioned.
            A sequence of booleans is also acceptable.
            A dictionary can be used for selection
            on a MultiIndex on different levels.
        invert: Whether or not to invert the selection.
            This will result in the selection
            of the complement of the rows provided.

    Returns:
        A pandas DataFrame with the specified rows selected.
    """  # noqa: E501
    return _select(df, rows=list(args), invert=invert)

shuffle

Implementation of shuffle functions.

shuffle(df, random_state=None, reset_index=True)

Shuffle the rows of the DataFrame.

This method does not mutate the original DataFrame.

Super-sugary syntax! Underneath the hood, we use df.sample(frac=1), with the option to set the random state.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "col1": range(5),
...     "col2": list("abcde"),
... })
>>> df
   col1 col2
0     0    a
1     1    b
2     2    c
3     3    d
4     4    e
>>> df.shuffle(random_state=42)
   col1 col2
0     1    b
1     4    e
2     2    c
3     0    a
4     3    d

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
random_state Any

If provided, set a seed for the random number generator. Passed to pd.DataFrame.sample().

None
reset_index bool

If True, reset the dataframe index to the default RangeIndex.

True

Returns:

Type Description
DataFrame

A shuffled pandas DataFrame.

Source code in janitor/functions/shuffle.py
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
@pf.register_dataframe_method
def shuffle(
    df: pd.DataFrame, random_state: Any = None, reset_index: bool = True
) -> pd.DataFrame:
    """Shuffle the rows of the DataFrame.

    This method does not mutate the original DataFrame.

    Super-sugary syntax! Underneath the hood, we use `df.sample(frac=1)`,
    with the option to set the random state.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "col1": range(5),
        ...     "col2": list("abcde"),
        ... })
        >>> df
           col1 col2
        0     0    a
        1     1    b
        2     2    c
        3     3    d
        4     4    e
        >>> df.shuffle(random_state=42)
           col1 col2
        0     1    b
        1     4    e
        2     2    c
        3     0    a
        4     3    d

    Args:
        df: A pandas DataFrame.
        random_state: If provided, set a seed for the random number
            generator. Passed to `pd.DataFrame.sample()`.
        reset_index: If True, reset the dataframe index to the default
            RangeIndex.

    Returns:
        A shuffled pandas DataFrame.
    """
    result = df.sample(frac=1, random_state=random_state)
    if reset_index:
        result = result.reset_index(drop=True)
    return result

sort_column_value_order

Implementation of the sort_column_value_order function.

sort_column_value_order(df, column, column_value_order, columns=None)

This function adds precedence to certain values in a specified column, then sorts based on that column and any other specified columns.

Examples:

>>> import pandas as pd
>>> import janitor
>>> import numpy as np
>>> company_sales = {
...     "SalesMonth": ["Jan", "Feb", "Feb", "Mar", "April"],
...     "Company1": [150.0, 200.0, 200.0, 300.0, 400.0],
...     "Company2": [180.0, 250.0, 250.0, np.nan, 500.0],
...     "Company3": [400.0, 500.0, 500.0, 600.0, 675.0],
... }
>>> df = pd.DataFrame.from_dict(company_sales)
>>> df
  SalesMonth  Company1  Company2  Company3
0        Jan     150.0     180.0     400.0
1        Feb     200.0     250.0     500.0
2        Feb     200.0     250.0     500.0
3        Mar     300.0       NaN     600.0
4      April     400.0     500.0     675.0
>>> df.sort_column_value_order(
...     "SalesMonth",
...     {"April": 1, "Mar": 2, "Feb": 3, "Jan": 4}
... )
  SalesMonth  Company1  Company2  Company3
4      April     400.0     500.0     675.0
3        Mar     300.0       NaN     600.0
1        Feb     200.0     250.0     500.0
2        Feb     200.0     250.0     500.0
0        Jan     150.0     180.0     400.0

Parameters:

Name Type Description Default
df DataFrame

pandas DataFrame that we are manipulating

required
column str

This is a column name as a string we are using to specify which column to sort by

required
column_value_order dict

Dictionary of values that will represent precedence of the values in the specified column

required
columns

A list of additional columns that we can sort by

None

Raises:

Type Description
ValueError

If chosen Column Name is not in Dataframe, or if column_value_order dictionary is empty.

Returns:

Type Description
DataFrame

A sorted pandas DataFrame.

Source code in janitor/functions/sort_column_value_order.py
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
@pf.register_dataframe_method
def sort_column_value_order(
    df: pd.DataFrame, column: str, column_value_order: dict, columns=None
) -> pd.DataFrame:
    """This function adds precedence to certain values in a specified column,
    then sorts based on that column and any other specified columns.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> import numpy as np
        >>> company_sales = {
        ...     "SalesMonth": ["Jan", "Feb", "Feb", "Mar", "April"],
        ...     "Company1": [150.0, 200.0, 200.0, 300.0, 400.0],
        ...     "Company2": [180.0, 250.0, 250.0, np.nan, 500.0],
        ...     "Company3": [400.0, 500.0, 500.0, 600.0, 675.0],
        ... }
        >>> df = pd.DataFrame.from_dict(company_sales)
        >>> df
          SalesMonth  Company1  Company2  Company3
        0        Jan     150.0     180.0     400.0
        1        Feb     200.0     250.0     500.0
        2        Feb     200.0     250.0     500.0
        3        Mar     300.0       NaN     600.0
        4      April     400.0     500.0     675.0
        >>> df.sort_column_value_order(
        ...     "SalesMonth",
        ...     {"April": 1, "Mar": 2, "Feb": 3, "Jan": 4}
        ... )
          SalesMonth  Company1  Company2  Company3
        4      April     400.0     500.0     675.0
        3        Mar     300.0       NaN     600.0
        1        Feb     200.0     250.0     500.0
        2        Feb     200.0     250.0     500.0
        0        Jan     150.0     180.0     400.0

    Args:
        df: pandas DataFrame that we are manipulating
        column: This is a column name as a string we are using to specify
            which column to sort by
        column_value_order: Dictionary of values that will
            represent precedence of the values in the specified column
        columns: A list of additional columns that we can sort by

    Raises:
        ValueError: If chosen Column Name is not in
            Dataframe, or if `column_value_order` dictionary is empty.

    Returns:
        A sorted pandas DataFrame.
    """
    # Validation checks
    check_column(df, column, present=True)
    check("column_value_order", column_value_order, [dict])
    if not column_value_order:
        raise ValueError("column_value_order dictionary cannot be empty")

    df = df.assign(cond_order=df[column].map(column_value_order))

    sort_by = ["cond_order"]
    if columns is not None:
        sort_by = ["cond_order"] + columns

    df = df.sort_values(sort_by).remove_columns("cond_order")
    return df

sort_naturally

Implementation of the sort_naturally function.

sort_naturally(df, column_name, **natsorted_kwargs)

Sort a DataFrame by a column using natural sorting.

Natural sorting is distinct from the default lexiographical sorting provided by pandas. For example, given the following list of items:

["A1", "A11", "A3", "A2", "A10"]

Lexicographical sorting would give us:

["A1", "A10", "A11", "A2", "A3"]

By contrast, "natural" sorting would give us:

["A1", "A2", "A3", "A10", "A11"]

This function thus provides natural sorting on a single column of a dataframe.

To accomplish this, we do a natural sort on the unique values that are present in the dataframe. Then, we reconstitute the entire dataframe in the naturally sorted order.

Natural sorting is provided by the Python package natsort.

All keyword arguments to natsort should be provided after the column name to sort by is provided. They are passed through to the natsorted function.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame(
...     {
...         "Well": ["A21", "A3", "A21", "B2", "B51", "B12"],
...         "Value": [1, 2, 13, 3, 4, 7],
...     }
... )
>>> df
  Well  Value
0  A21      1
1   A3      2
2  A21     13
3   B2      3
4  B51      4
5  B12      7
>>> df.sort_naturally("Well")
  Well  Value
1   A3      2
0  A21      1
2  A21     13
3   B2      3
5  B12      7
4  B51      4

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_name str

The column on which natural sorting should take place.

required
**natsorted_kwargs Any

Keyword arguments to be passed to natsort's natsorted function.

{}

Returns:

Type Description
DataFrame

A sorted pandas DataFrame.

Source code in janitor/functions/sort_naturally.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
@pf.register_dataframe_method
def sort_naturally(
    df: pd.DataFrame, column_name: str, **natsorted_kwargs: Any
) -> pd.DataFrame:
    """Sort a DataFrame by a column using *natural* sorting.

    Natural sorting is distinct from
    the default lexiographical sorting provided by `pandas`.
    For example, given the following list of items:

    ```python
    ["A1", "A11", "A3", "A2", "A10"]
    ```

    Lexicographical sorting would give us:

    ```python
    ["A1", "A10", "A11", "A2", "A3"]
    ```

    By contrast, "natural" sorting would give us:

    ```python
    ["A1", "A2", "A3", "A10", "A11"]
    ```

    This function thus provides *natural* sorting
    on a single column of a dataframe.

    To accomplish this, we do a natural sort
    on the unique values that are present in the dataframe.
    Then, we reconstitute the entire dataframe
    in the naturally sorted order.

    Natural sorting is provided by the Python package
    [natsort](https://natsort.readthedocs.io/en/master/index.html).

    All keyword arguments to `natsort` should be provided
    after the column name to sort by is provided.
    They are passed through to the `natsorted` function.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame(
        ...     {
        ...         "Well": ["A21", "A3", "A21", "B2", "B51", "B12"],
        ...         "Value": [1, 2, 13, 3, 4, 7],
        ...     }
        ... )
        >>> df
          Well  Value
        0  A21      1
        1   A3      2
        2  A21     13
        3   B2      3
        4  B51      4
        5  B12      7
        >>> df.sort_naturally("Well")
          Well  Value
        1   A3      2
        0  A21      1
        2  A21     13
        3   B2      3
        5  B12      7
        4  B51      4

    Args:
        df: A pandas DataFrame.
        column_name: The column on which natural sorting should take place.
        **natsorted_kwargs: Keyword arguments to be passed
            to natsort's `natsorted` function.

    Returns:
        A sorted pandas DataFrame.
    """
    new_order = index_natsorted(df[column_name], **natsorted_kwargs)
    return df.iloc[new_order, :]

take_first

Implementation of take_first function.

take_first(df, subset, by, ascending=True)

Take the first row within each group specified by subset.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": ["x", "x", "y", "y"], "b": [0, 1, 2, 3]})
>>> df
   a  b
0  x  0
1  x  1
2  y  2
3  y  3
>>> df.take_first(subset="a", by="b")
   a  b
0  x  0
2  y  2

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
subset Union[Hashable, Iterable[Hashable]]

Column(s) defining the group.

required
by Hashable

Column to sort by.

required
ascending bool

Whether or not to sort in ascending order, bool.

True

Returns:

Type Description
DataFrame

A pandas DataFrame.

Source code in janitor/functions/take_first.py
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
@pf.register_dataframe_method
def take_first(
    df: pd.DataFrame,
    subset: Union[Hashable, Iterable[Hashable]],
    by: Hashable,
    ascending: bool = True,
) -> pd.DataFrame:
    """Take the first row within each group specified by `subset`.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({"a": ["x", "x", "y", "y"], "b": [0, 1, 2, 3]})
        >>> df
           a  b
        0  x  0
        1  x  1
        2  y  2
        3  y  3
        >>> df.take_first(subset="a", by="b")
           a  b
        0  x  0
        2  y  2

    Args:
        df: A pandas DataFrame.
        subset: Column(s) defining the group.
        by: Column to sort by.
        ascending: Whether or not to sort in ascending order, `bool`.

    Returns:
        A pandas DataFrame.
    """
    result = df.sort_values(by=by, ascending=ascending).drop_duplicates(
        subset=subset, keep="first"
    )

    return result

then

Implementation source for then.

then(df, func)

Add an arbitrary function to run in the pyjanitor method chain.

This method does not mutate the original DataFrame.

Note

This function will be deprecated in a 1.x release. Please use pd.DataFrame.pipe instead.

Examples:

A trivial example using a lambda func.

>>> import pandas as pd
>>> import janitor
>>> (pd.DataFrame({"a": [1, 2, 3], "b": [7, 8, 9]})
...  .then(lambda df: df * 2))
   a   b
0  2  14
1  4  16
2  6  18

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
func Callable

A function you would like to run in the method chain. It should take one parameter and return one parameter, each being the DataFrame object. After that, do whatever you want in the middle. Go crazy.

required

Returns:

Type Description
DataFrame

A pandas DataFrame.

Source code in janitor/functions/then.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
@pf.register_dataframe_method
@refactored_function(
    message="This function will be deprecated in a 1.x release. "
    "Kindly use `pd.DataFrame.pipe` instead."
)
def then(df: pd.DataFrame, func: Callable) -> pd.DataFrame:
    """Add an arbitrary function to run in the `pyjanitor` method chain.

    This method does not mutate the original DataFrame.

    !!!note

        This function will be deprecated in a 1.x release.
        Please use `pd.DataFrame.pipe` instead.

    Examples:
        A trivial example using a lambda `func`.

        >>> import pandas as pd
        >>> import janitor
        >>> (pd.DataFrame({"a": [1, 2, 3], "b": [7, 8, 9]})
        ...  .then(lambda df: df * 2))
           a   b
        0  2  14
        1  4  16
        2  6  18

    Args:
        df: A pandas DataFrame.
        func: A function you would like to run in the method chain.
            It should take one parameter and return one parameter, each being
            the DataFrame object. After that, do whatever you want in the
            middle. Go crazy.

    Returns:
        A pandas DataFrame.
    """
    df = func(df)
    return df

to_datetime

Implementation source for to_datetime.

to_datetime(df, column_name, **kwargs)

Convert column to a datetime type, in-place.

Intended to be the method-chaining equivalent of:

df[column_name] = pd.to_datetime(df[column_name], **kwargs)

This method mutates the original DataFrame.

Note

This function will be deprecated in a 1.x release. Please use jn.transform_column instead.

Examples:

Converting a string column to datetime type with custom format.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({'date': ['20200101', '20200202', '20200303']})
>>> df
       date
0  20200101
1  20200202
2  20200303
>>> df.to_datetime('date', format='%Y%m%d')
        date
0 2020-01-01
1 2020-02-02
2 2020-03-03

Read the pandas documentation for to_datetime for more information.

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_name Hashable

Column name.

required
**kwargs Any

Provide any kwargs that pd.to_datetime can take.

{}

Returns:

Type Description
DataFrame

A pandas DataFrame with updated datetime data.

Source code in janitor/functions/to_datetime.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
@pf.register_dataframe_method
@deprecated_alias(column="column_name")
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `jn.transform_columns` instead."
    )
)
def to_datetime(
    df: pd.DataFrame, column_name: Hashable, **kwargs: Any
) -> pd.DataFrame:
    """Convert column to a datetime type, in-place.

    Intended to be the method-chaining equivalent of:

    ```python
    df[column_name] = pd.to_datetime(df[column_name], **kwargs)
    ```

    This method mutates the original DataFrame.

    !!!note

        This function will be deprecated in a 1.x release.
        Please use [`jn.transform_column`][janitor.functions.transform_columns.transform_column]
        instead.

    Examples:
        Converting a string column to datetime type with custom format.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({'date': ['20200101', '20200202', '20200303']})
        >>> df
               date
        0  20200101
        1  20200202
        2  20200303
        >>> df.to_datetime('date', format='%Y%m%d')
                date
        0 2020-01-01
        1 2020-02-02
        2 2020-03-03

    Read the pandas documentation for [`to_datetime`][pd_docs] for more information.

    [pd_docs]: https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html

    Args:
        df: A pandas DataFrame.
        column_name: Column name.
        **kwargs: Provide any kwargs that `pd.to_datetime` can take.

    Returns:
        A pandas DataFrame with updated datetime data.
    """  # noqa: E501
    df[column_name] = pd.to_datetime(df[column_name], **kwargs)

    return df

toset

Implementation of the toset function.

toset(series)

Return a set of the values.

Note

This function will be deprecated in a 1.x release. Please use set(df[column]) instead.

These are each a scalar type, which is a Python scalar (for str, int, float) or a pandas scalar (for Timestamp/Timedelta/Interval/Period)

Examples:

>>> import pandas as pd
>>> import janitor
>>> s = pd.Series([1, 2, 3, 5, 5], index=["a", "b", "c", "d", "e"])
>>> s
a    1
b    2
c    3
d    5
e    5
dtype: int64
>>> s.toset()
{1, 2, 3, 5}

Parameters:

Name Type Description Default
series Series

A pandas series.

required

Returns:

Type Description
Set

A set of values.

Source code in janitor/functions/toset.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
@pf.register_series_method
@refactored_function(
    message=(
        "This function will be deprecated in a 1.x release. "
        "Please use `set(df[column])` instead."
    )
)
def toset(series: pd.Series) -> Set:
    """Return a set of the values.

    !!!note

        This function will be deprecated in a 1.x release.
        Please use `set(df[column])` instead.

    These are each a scalar type, which is a Python scalar
    (for str, int, float) or a pandas scalar
    (for Timestamp/Timedelta/Interval/Period)

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> s = pd.Series([1, 2, 3, 5, 5], index=["a", "b", "c", "d", "e"])
        >>> s
        a    1
        b    2
        c    3
        d    5
        e    5
        dtype: int64
        >>> s.toset()
        {1, 2, 3, 5}

    Args:
        series: A pandas series.

    Returns:
        A set of values.
    """

    return set(series.tolist())

transform_columns

transform_column(df, column_name, function, dest_column_name=None, elementwise=True)

Transform the given column using the provided function.

Meant to be the method-chaining equivalent of:

df[dest_column_name] = df[column_name].apply(function)

Functions can be applied in one of two ways:

  • Element-wise (default; elementwise=True). Then, the individual column elements will be passed in as the first argument of function.
  • Column-wise (elementwise=False). Then, function is expected to take in a pandas Series and return a sequence that is of identical length to the original.

If dest_column_name is provided, then the transformation result is stored in that column. Otherwise, the transformed result is stored under the name of the original column.

This method does not mutate the original DataFrame.

Examples:

Transform a column in-place with an element-wise function.

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "a": [2, 3, 4],
...     "b": ["area", "pyjanitor", "grapefruit"],
... })
>>> df
   a           b
0  2        area
1  3   pyjanitor
2  4  grapefruit
>>> df.transform_column(
...     column_name="a",
...     function=lambda x: x**2 - 1,
... )
    a           b
0   3        area
1   8   pyjanitor
2  15  grapefruit

Examples:

Transform a column in-place with an column-wise function.

>>> df.transform_column(
...     column_name="b",
...     function=lambda srs: srs.str[:5],
...     elementwise=False,
... )
   a      b
0  2   area
1  3  pyjan
2  4  grape

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_name Hashable

The column to transform.

required
function Callable

A function to apply on the column.

required
dest_column_name Optional[str]

The column name to store the transformation result in. Defaults to None, which will result in the original column name being overwritten. If a name is provided here, then a new column with the transformed values will be created.

None
elementwise bool

Whether to apply the function elementwise or not. If elementwise is True, then the function's first argument should be the data type of each datum in the column of data, and should return a transformed datum. If elementwise is False, then the function's should expect a pandas Series passed into it, and return a pandas Series.

True

Returns:

Type Description
DataFrame

A pandas DataFrame with a transformed column.

Source code in janitor/functions/transform_columns.py
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
@pf.register_dataframe_method
@deprecated_alias(col_name="column_name", dest_col_name="dest_column_name")
def transform_column(
    df: pd.DataFrame,
    column_name: Hashable,
    function: Callable,
    dest_column_name: Optional[str] = None,
    elementwise: bool = True,
) -> pd.DataFrame:
    """Transform the given column using the provided function.

    Meant to be the method-chaining equivalent of:
    ```python
    df[dest_column_name] = df[column_name].apply(function)
    ```

    Functions can be applied in one of two ways:

    - **Element-wise** (default; `elementwise=True`). Then, the individual
    column elements will be passed in as the first argument of `function`.
    - **Column-wise** (`elementwise=False`). Then, `function` is expected to
    take in a pandas Series and return a sequence that is of identical length
    to the original.

    If `dest_column_name` is provided, then the transformation result is stored
    in that column. Otherwise, the transformed result is stored under the name
    of the original column.

    This method does not mutate the original DataFrame.

    Examples:
        Transform a column in-place with an element-wise function.

        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "a": [2, 3, 4],
        ...     "b": ["area", "pyjanitor", "grapefruit"],
        ... })
        >>> df
           a           b
        0  2        area
        1  3   pyjanitor
        2  4  grapefruit
        >>> df.transform_column(
        ...     column_name="a",
        ...     function=lambda x: x**2 - 1,
        ... )
            a           b
        0   3        area
        1   8   pyjanitor
        2  15  grapefruit

    Examples:
        Transform a column in-place with an column-wise function.

        >>> df.transform_column(
        ...     column_name="b",
        ...     function=lambda srs: srs.str[:5],
        ...     elementwise=False,
        ... )
           a      b
        0  2   area
        1  3  pyjan
        2  4  grape

    Args:
        df: A pandas DataFrame.
        column_name: The column to transform.
        function: A function to apply on the column.
        dest_column_name: The column name to store the transformation result
            in. Defaults to None, which will result in the original column
            name being overwritten. If a name is provided here, then a new
            column with the transformed values will be created.
        elementwise: Whether to apply the function elementwise or not.
            If `elementwise` is True, then the function's first argument
            should be the data type of each datum in the column of data,
            and should return a transformed datum.
            If `elementwise` is False, then the function's should expect
            a pandas Series passed into it, and return a pandas Series.

    Returns:
        A pandas DataFrame with a transformed column.
    """
    check_column(df, column_name)

    if dest_column_name is None:
        dest_column_name = column_name
    elif dest_column_name != column_name:
        # If `dest_column_name` is provided and equals `column_name`, then we
        # assume that the user's intent is to perform an in-place
        # transformation (Same behaviour as when `dest_column_name` = None).
        # Otherwise we throw an error if `dest_column_name` already exists in
        # df.
        check_column(df, dest_column_name, present=False)

    result = _get_transform_column_result(
        df[column_name],
        function,
        elementwise,
    )

    return df.assign(**{dest_column_name: result})

transform_columns(df, column_names, function, suffix=None, elementwise=True, new_column_names=None)

Transform multiple columns through the same transformation.

This method does not mutate the original DataFrame.

Super syntactic sugar! Essentially wraps transform_column and calls it repeatedly over all column names provided.

User can optionally supply either a suffix to create a new set of columns with the specified suffix, or provide a dictionary mapping each original column name in column_names to its corresponding new column name. Note that all column names must be strings.

Examples:

log10 transform a list of columns, replacing original columns.

>>> import numpy as np
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "col1": [5, 10, 15],
...     "col2": [3, 6, 9],
...     "col3": [10, 100, 1_000],
... })
>>> df
   col1  col2  col3
0     5     3    10
1    10     6   100
2    15     9  1000
>>> df.transform_columns(["col1", "col2", "col3"], np.log10)
       col1      col2  col3
0  0.698970  0.477121   1.0
1  1.000000  0.778151   2.0
2  1.176091  0.954243   3.0

Using the suffix parameter to create new columns.

>>> df.transform_columns(["col1", "col3"], np.log10, suffix="_log")
   col1  col2  col3  col1_log  col3_log
0     5     3    10  0.698970       1.0
1    10     6   100  1.000000       2.0
2    15     9  1000  1.176091       3.0

Using the new_column_names parameter to create new columns.

>>> df.transform_columns(
...     ["col1", "col3"],
...     np.log10,
...     new_column_names={"col1": "transform1"},
... )
   col1  col2  col3  transform1
0     5     3   1.0    0.698970
1    10     6   2.0    1.000000
2    15     9   3.0    1.176091

Parameters:

Name Type Description Default
df DataFrame

A pandas DataFrame.

required
column_names Union[List[str], Tuple[str]]

An iterable of columns to transform.

required
function Callable

A function to apply on each column.

required
suffix Optional[str]

Suffix to use when creating new columns to hold the transformed values.

None
elementwise bool

Passed on to transform_column; whether or not to apply the transformation function elementwise (True) or columnwise (False).

True
new_column_names Optional[Dict[str, str]]

An explicit mapping of old column names in column_names to new column names. If any column specified in column_names is not a key in this dictionary, the transformation will happen in-place for that column.

None

Raises:

Type Description
ValueError

If both suffix and new_column_names are specified.

Returns:

Type Description
DataFrame

A pandas DataFrame with transformed columns.

Source code in janitor/functions/transform_columns.py
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
@pf.register_dataframe_method
@deprecated_alias(columns="column_names", new_names="new_column_names")
def transform_columns(
    df: pd.DataFrame,
    column_names: Union[List[str], Tuple[str]],
    function: Callable,
    suffix: Optional[str] = None,
    elementwise: bool = True,
    new_column_names: Optional[Dict[str, str]] = None,
) -> pd.DataFrame:
    """Transform multiple columns through the same transformation.

    This method does not mutate the original DataFrame.

    Super syntactic sugar!
    Essentially wraps [`transform_column`][janitor.functions.transform_columns.transform_column]
    and calls it repeatedly over all column names provided.

    User can optionally supply either a suffix to create a new set of columns
    with the specified suffix, or provide a dictionary mapping each original
    column name in `column_names` to its corresponding new column name.
    Note that all column names must be strings.

    Examples:
        log10 transform a list of columns, replacing original columns.

        >>> import numpy as np
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "col1": [5, 10, 15],
        ...     "col2": [3, 6, 9],
        ...     "col3": [10, 100, 1_000],
        ... })
        >>> df
           col1  col2  col3
        0     5     3    10
        1    10     6   100
        2    15     9  1000
        >>> df.transform_columns(["col1", "col2", "col3"], np.log10)
               col1      col2  col3
        0  0.698970  0.477121   1.0
        1  1.000000  0.778151   2.0
        2  1.176091  0.954243   3.0

        Using the `suffix` parameter to create new columns.

        >>> df.transform_columns(["col1", "col3"], np.log10, suffix="_log")
           col1  col2  col3  col1_log  col3_log
        0     5     3    10  0.698970       1.0
        1    10     6   100  1.000000       2.0
        2    15     9  1000  1.176091       3.0

        Using the `new_column_names` parameter to create new columns.

        >>> df.transform_columns(
        ...     ["col1", "col3"],
        ...     np.log10,
        ...     new_column_names={"col1": "transform1"},
        ... )
           col1  col2  col3  transform1
        0     5     3   1.0    0.698970
        1    10     6   2.0    1.000000
        2    15     9   3.0    1.176091

    Args:
        df: A pandas DataFrame.
        column_names: An iterable of columns to transform.
        function: A function to apply on each column.
        suffix: Suffix to use when creating new columns to hold
            the transformed values.
        elementwise: Passed on to [`transform_column`][janitor.functions.transform_columns.transform_column]; whether or not
            to apply the transformation function elementwise (True)
            or columnwise (False).
        new_column_names: An explicit mapping of old column names in
            `column_names` to new column names. If any column specified in
            `column_names` is not a key in this dictionary, the transformation
            will happen in-place for that column.

    Raises:
        ValueError: If both `suffix` and `new_column_names` are specified.

    Returns:
        A pandas DataFrame with transformed columns.
    """  # noqa: E501
    check("column_names", column_names, [list, tuple])
    check_column(df, column_names)

    if suffix is not None and new_column_names is not None:
        raise ValueError(
            "Only one of `suffix` or `new_column_names` should be specified."
        )

    if suffix:
        check("suffix", suffix, [str])
        dest_column_names = {col: col + suffix for col in column_names}
    elif new_column_names:
        check("new_column_names", new_column_names, [dict])
        dest_column_names = {
            col: new_column_names.get(col, col) for col in column_names
        }
    else:
        dest_column_names = dict(zip(column_names, column_names))

    results = {}
    for old_col, new_col in dest_column_names.items():
        if old_col != new_col:
            check_column(df, new_col, present=False)
        results[new_col] = _get_transform_column_result(
            df[old_col],
            function,
            elementwise=elementwise,
        )

    return df.assign(**results)

truncate_datetime

Implementation of the truncate_datetime family of functions.

truncate_datetime_dataframe(df, datepart)

Truncate times down to a user-specified precision of year, month, day, hour, minute, or second.

This method does not mutate the original DataFrame.

Examples:

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
...     "foo": ["xxxx", "yyyy", "zzzz"],
...     "dt": pd.date_range("2020-03-11", periods=3, freq="15H"),
... })
>>> df
    foo                  dt
0  xxxx 2020-03-11 00:00:00
1  yyyy 2020-03-11 15:00:00
2  zzzz 2020-03-12 06:00:00
>>> df.truncate_datetime_dataframe("day")
    foo         dt
0  xxxx 2020-03-11
1  yyyy 2020-03-11
2  zzzz 2020-03-12

Parameters:

Name Type Description Default
df DataFrame

The pandas DataFrame on which to truncate datetime.

required
datepart str

Truncation precision, YEAR, MONTH, DAY, HOUR, MINUTE, SECOND. (String is automagically capitalized)

required

Raises:

Type Description
ValueError

If an invalid datepart precision is passed in.

Returns:

Type Description
DataFrame

A pandas DataFrame with all valid datetimes truncated down to the specified precision.

Source code in janitor/functions/truncate_datetime.py
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
@pf.register_dataframe_method
def truncate_datetime_dataframe(
    df: pd.DataFrame,
    datepart: str,
) -> pd.DataFrame:
    """Truncate times down to a user-specified precision of
    year, month, day, hour, minute, or second.

    This method does not mutate the original DataFrame.

    Examples:
        >>> import pandas as pd
        >>> import janitor
        >>> df = pd.DataFrame({
        ...     "foo": ["xxxx", "yyyy", "zzzz"],
        ...     "dt": pd.date_range("2020-03-11", periods=3, freq="15H"),
        ... })
        >>> df
            foo                  dt
        0  xxxx 2020-03-11 00:00:00
        1  yyyy 2020-03-11 15:00:00
        2  zzzz 2020-03-12 06:00:00
        >>> df.truncate_datetime_dataframe("day")
            foo         dt
        0  xxxx 2020-03-11
        1  yyyy 2020-03-11
        2  zzzz 2020-03-12

    Args:
        df: The pandas DataFrame on which to truncate datetime.
        datepart: Truncation precision, YEAR, MONTH, DAY,
            HOUR, MINUTE, SECOND. (String is automagically
            capitalized)

    Raises:
        ValueError: If an invalid `datepart` precision is passed in.

    Returns:
        A pandas DataFrame with all valid datetimes truncated down
            to the specified precision.
    """
    # idea from Stack Overflow
    # https://stackoverflow.com/a/28783971/7175713
    # https://numpy.org/doc/stable/reference/arrays.datetime.html
    ACCEPTABLE_DATEPARTS = {
        "YEAR": "datetime64[Y]",
        "MONTH": "datetime64[M]",
        "DAY": "datetime64[D]",
        "HOUR": "datetime64[h]",
        "MINUTE": "datetime64[m]",
        "SECOND": "datetime64[s]",
    }
    datepart = datepart.upper()
    if datepart not in ACCEPTABLE_DATEPARTS:
        raise ValueError(
            "Received an invalid `datepart` precision. "
            f"Please enter any one of {ACCEPTABLE_DATEPARTS}."
        )

    dictionary = {}

    for label, series in df.items():
        if is_datetime64_any_dtype(series):
            dtype = ACCEPTABLE_DATEPARTS[datepart]
            # TODO: add branch for pyarrow arrays
            series = np.array(series._values, dtype=dtype)
        dictionary[label] = series

    return pd.DataFrame(dictionary)

update_where

Function for updating values based on other column values.

update_where(df, conditions, target_column_name, target_val)

Add multiple conditions to update a column in the dataframe.

This method does not mutate the original DataFrame.

Examples:

>>> import janitor
>>> data = {
...    "a": [1, 2, 3, 4],
...    "b": [5, 6, 7, 8],
...    "c": [0, 0, 0, 0],
... }
>>> df = pd.DataFrame(data)
>>> df
   a  b  c
0  1  5  0
1  2  6  0
2  3  7  0
3  4  8  0
>>> df.update_where(
...    conditions = (df.a > 2) & (df.b < 8),
...    target_column_name = 'c',
...    target_val = 10
... )
   a  b   c
0  1  5   0
1  2  6   0
2  3  7  10
3  4  8   0
>>> df.update_where( # supports pandas *query* style string expressions
...    conditions = "a > 2 and b < 8",
...    target_column_name = 'c',
...    target_val = 10
... )
   a  b   c
0  1  5   0
1  2  6   0
2  3  7  10
3  4  8   0

Parameters:

Name Type Description Default
df DataFrame

The pandas DataFrame object.

required
conditions Any

Conditions used to update a target column and target value.

required
target_column_name Hashable

Column to be updated. If column does not exist in DataFrame, a new column will be created; note that entries that do not get set in the new column will be null.

required
target_val Any

Value to be updated.

required

Raises:

Type Description
ValueError

If conditions does not return a boolean array-like data structure.

Returns:

Type Description
DataFrame

A pandas DataFrame.

Source code in janitor/functions/update_where.py
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
@pf.register_dataframe_method
@deprecated_alias(target_col="target_column_name")
def update_where(
    df: pd.DataFrame,
    conditions: Any,
    target_column_name: Hashable,
    target_val: Any,
) -> pd.DataFrame:
    """Add multiple conditions to update a column in the dataframe.

    This method does not mutate the original DataFrame.

    Examples:
        >>> import janitor
        >>> data = {
        ...    "a": [1, 2, 3, 4],
        ...    "b": [5, 6, 7, 8],
        ...    "c": [0, 0, 0, 0],
        ... }
        >>> df = pd.DataFrame(data)
        >>> df
           a  b  c
        0  1  5  0
        1  2  6  0
        2  3  7  0
        3  4  8  0
        >>> df.update_where(
        ...    conditions = (df.a > 2) & (df.b < 8),
        ...    target_column_name = 'c',
        ...    target_val = 10
        ... )
           a  b   c
        0  1  5   0
        1  2  6   0
        2  3  7  10
        3  4  8   0
        >>> df.update_where( # supports pandas *query* style string expressions
        ...    conditions = "a > 2 and b < 8",
        ...    target_column_name = 'c',
        ...    target_val = 10
        ... )
           a  b   c
        0  1  5   0
        1  2  6   0
        2  3  7  10
        3  4  8   0

    Args:
        df: The pandas DataFrame object.
        conditions: Conditions used to update a target column
            and target value.
        target_column_name: Column to be updated. If column does not exist
            in DataFrame, a new column will be created; note that entries
            that do not get set in the new column will be null.
        target_val: Value to be updated.

    Raises:
        ValueError: If `conditions` does not return a boolean array-like
            data structure.

    Returns:
        A pandas DataFrame.
    """

    df = df.copy()

    # use query mode if a string expression is passed
    if isinstance(conditions, str):
        conditions = df.eval(conditions)

    if not is_bool_dtype(conditions):
        raise ValueError(
            """
            Kindly ensure that `conditions` passed
            evaluates to a Boolean dtype.
            """
        )

    df.loc[conditions, target_column_name] = target_val

    return df

utils

Utility functions for all of the functions submodule.

DropLabel dataclass

Helper class for removing labels within the select syntax.

label can be any of the types supported in the select, select_rows and select_columns functions. An array of integers not matching the labels is returned.

New in version 0.24.0

Parameters:

Name Type Description Default
label Any

Label(s) to be dropped from the index.

required
Source code in janitor/functions/utils.py
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
@dataclass
class DropLabel:
    """Helper class for removing labels within the `select` syntax.

    `label` can be any of the types supported in the `select`,
    `select_rows` and `select_columns` functions.
    An array of integers not matching the labels is returned.

    !!! info "New in version 0.24.0"

    Args:
        label: Label(s) to be dropped from the index.
    """

    label: Any

col

Helper class for column selection within an expression.

Parameters:

Name Type Description Default
column Hashable

The name of the column to be selected.

required

Raises:

Type Description
TypeError

If the column parameter is not hashable.

New in version 0.25.0

Warning

col is currently considered experimental. The implementation and parts of the API may change without warning.

Source code in janitor/functions/utils.py
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
class col:
    """Helper class for column selection within an expression.

    Args:
        column (Hashable): The name of the column to be selected.

    Raises:
        TypeError: If the `column` parameter is not hashable.

    !!! info "New in version 0.25.0"

    !!! warning

        `col` is currently considered experimental.
        The implementation and parts of the API
        may change without warning.

    """

    def __init__(self, column: Hashable):
        self.cols = column
        check("column", self.cols, [Hashable])
        self.join_args = None

    def __gt__(self, other):
        """Implements the greater-than comparison operator (`>`).

        Args:
            other (col): The other `col` object to compare to.

        Returns:
            col: The current `col` object.
        """
        self.join_args = (self.cols, other.cols, ">")
        return self

    def __ge__(self, other):
        """Implements the greater-than-or-equal-to comparison operator (`>=`).

        Args:
            other (col): The other `col` object to compare to.

        Returns:
            col: The current `col` object.
        """
        self.join_args = (self.cols, other.cols, ">=")
        return self

    def __lt__(self, other):
        """Implements the less-than comparison operator (`<`).

        Args:
            other (col): The other `col` object to compare to.

        Returns:
            col: The current `col` object.
        """
        self.join_args = (self.cols, other.cols, "<")
        return self

    def __le__(self, other):
        """Implements the less-than-or-equal-to comparison operator (`<=`).

        Args:
            other (col): The other `col` object to compare to.

        Returns:
            col: The current `col` object.
        """
        self.join_args = (self.cols, other.cols, "<=")
        return self

    def __ne__(self, other):
        """Implements the not-equal-to comparison operator (`!=`).

        Args:
            other (col): The other `col` object to compare to.

        Returns:
            col: The current `col` object.
        """
        self.join_args = (self.cols, other.cols, "!=")
        return self

    def __eq__(self, other):
        """Implements the equal-to comparison operator (`==`).

        Args:
            other (col): The other `col` object to compare to.

        Returns:
            col: The current `col` object.
        """
        self.join_args = (self.cols, other.cols, "==")
        return self

__eq__(other)

Implements the equal-to comparison operator (==).

Parameters:

Name Type Description Default
other col

The other col object to compare to.

required

Returns:

Name Type Description
col

The current col object.

Source code in janitor/functions/utils.py
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
def __eq__(self, other):
    """Implements the equal-to comparison operator (`==`).

    Args:
        other (col): The other `col` object to compare to.

    Returns:
        col: The current `col` object.
    """
    self.join_args = (self.cols, other.cols, "==")
    return self

__ge__(other)

Implements the greater-than-or-equal-to comparison operator (>=).

Parameters:

Name Type Description Default
other col

The other col object to compare to.

required

Returns:

Name Type Description
col

The current col object.

Source code in janitor/functions/utils.py
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
def __ge__(self, other):
    """Implements the greater-than-or-equal-to comparison operator (`>=`).

    Args:
        other (col): The other `col` object to compare to.

    Returns:
        col: The current `col` object.
    """
    self.join_args = (self.cols, other.cols, ">=")
    return self

__gt__(other)

Implements the greater-than comparison operator (>).

Parameters:

Name Type Description Default
other col

The other col object to compare to.

required

Returns:

Name Type Description
col

The current col object.

Source code in janitor/functions/utils.py
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
def __gt__(self, other):
    """Implements the greater-than comparison operator (`>`).

    Args:
        other (col): The other `col` object to compare to.

    Returns:
        col: The current `col` object.
    """
    self.join_args = (self.cols, other.cols, ">")
    return self

__le__(other)

Implements the less-than-or-equal-to comparison operator (<=).

Parameters:

Name Type Description Default
other col

The other col object to compare to.

required

Returns:

Name Type Description
col

The current col object.

Source code in janitor/functions/utils.py
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
def __le__(self, other):
    """Implements the less-than-or-equal-to comparison operator (`<=`).

    Args:
        other (col): The other `col` object to compare to.

    Returns:
        col: The current `col` object.
    """
    self.join_args = (self.cols, other.cols, "<=")
    return self

__lt__(other)

Implements the less-than comparison operator (<).

Parameters:

Name Type Description Default
other col

The other col object to compare to.

required

Returns:

Name Type Description
col

The current col object.

Source code in janitor/functions/utils.py
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
def __lt__(self, other):
    """Implements the less-than comparison operator (`<`).

    Args:
        other (col): The other `col` object to compare to.

    Returns:
        col: The current `col` object.
    """
    self.join_args = (self.cols, other.cols, "<")
    return self

__ne__(other)

Implements the not-equal-to comparison operator (!=).

Parameters:

Name Type Description Default
other col

The other col object to compare to.

required

Returns:

Name Type Description
col

The current col object.

Source code in janitor/functions/utils.py
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
def __ne__(self, other):
    """Implements the not-equal-to comparison operator (`!=`).

    Args:
        other (col): The other `col` object to compare to.

    Returns:
        col: The current `col` object.
    """
    self.join_args = (self.cols, other.cols, "!=")
    return self

get_columns(group, label)

Helper function for selecting columns on a grouped object, using the select syntax.

New in version 0.25.0

Parameters:

Name Type Description Default
group Union[DataFrameGroupBy, SeriesGroupBy]

A Pandas GroupBy object.

required
label

column(s) to select.

required

Returns:

Type Description

A pandas groupby object.

Source code in janitor/functions/utils.py
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
def get_columns(group: Union[DataFrameGroupBy, SeriesGroupBy], label):
    """
    Helper function for selecting columns on a grouped object,
    using the
    [`select`][janitor.functions.select.select] syntax.

    !!! info "New in version 0.25.0"

    Args:
        group: A Pandas GroupBy object.
        label: column(s) to select.

    Returns:
        A pandas groupby object.
    """
    check("groupby object", group, [DataFrameGroupBy, SeriesGroupBy])
    label = get_index_labels(label, group.obj, axis="columns")
    label = label if is_scalar(label) else list(label)
    return group[label]

get_index_labels(arg, df, axis)

Convenience function to get actual labels from column/index

New in version 0.25.0

Parameters:

Name Type Description Default
arg

Valid inputs include: an exact column name to look for, a shell-style glob string (e.g. *_thing_*), a regular expression, a callable, or variable arguments of all the aforementioned. A sequence of booleans is also acceptable. A dictionary can be used for selection on a MultiIndex on different levels.

required
df DataFrame

The pandas DataFrame object.

required
axis Literal['index', 'columns']

Should be either index or columns.

required

Returns:

Type Description
Index

A pandas Index.

Source code in janitor/functions/utils.py
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
def get_index_labels(
    arg, df: pd.DataFrame, axis: Literal["index", "columns"]
) -> pd.Index:
    """Convenience function to get actual labels from column/index

    !!! info "New in version 0.25.0"

    Args:
        arg: Valid inputs include: an exact column name to look for,
            a shell-style glob string (e.g. `*_thing_*`),
            a regular expression,
            a callable,
            or variable arguments of all the aforementioned.
            A sequence of booleans is also acceptable.
            A dictionary can be used for selection
            on a MultiIndex on different levels.
        df: The pandas DataFrame object.
        axis: Should be either `index` or `columns`.

    Returns:
        A pandas Index.
    """
    assert axis in {"index", "columns"}
    index = getattr(df, axis)
    return index[_select_index(arg, df, axis)]

patterns(regex_pattern)

This function converts a string into a compiled regular expression.

It can be used to select columns in the index or columns_names arguments of pivot_longer function.

Warning

This function is deprecated. Kindly use re.compile instead.

Parameters:

Name Type Description Default
regex_pattern Union[str, Pattern]

String to be converted to compiled regular expression.

required

Returns:

Type Description
Pattern

A compile regular expression from provided regex_pattern.

Source code in janitor/functions/utils.py
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
def patterns(regex_pattern: Union[str, Pattern]) -> Pattern:
    """This function converts a string into a compiled regular expression.

    It can be used to select columns in the index or columns_names
    arguments of `pivot_longer` function.

    !!!warning

        This function is deprecated. Kindly use `re.compile` instead.

    Args:
        regex_pattern: String to be converted to compiled regular
            expression.

    Returns:
        A compile regular expression from provided `regex_pattern`.
    """
    warnings.warn(
        "This function is deprecated. Kindly use `re.compile` instead.",
        DeprecationWarning,
        stacklevel=find_stack_level(),
    )
    check("regular expression", regex_pattern, [str, Pattern])

    return re.compile(regex_pattern)

unionize_dataframe_categories(*dataframes, column_names=None)

Given a group of dataframes which contain some categorical columns, for each categorical column present, find all the possible categories across all the dataframes which have that column. Update each dataframes' corresponding column with a new categorical object that contains the original data but has labels for all the possible categories from all dataframes. This is useful when concatenating a list of dataframes which all have the same categorical columns into one dataframe.

If, for a given categorical column, all input dataframes do not have at least one instance of all the possible categories, Pandas will change the output dtype of that column from category to object, losing out on dramatic speed gains you get from the former format.

Examples:

Usage example for concatenation of categorical column-containing dataframes:

Instead of:

concatenated_df = pd.concat([df1, df2, df3], ignore_index=True)

which in your case has resulted in category -> object conversion, use:

unionized_dataframes = unionize_dataframe_categories(df1, df2, df2)
concatenated_df = pd.concat(unionized_dataframes, ignore_index=True)

Parameters:

Name Type Description Default
*dataframes Any

The dataframes you wish to unionize the categorical objects for.

()
column_names Optional[Iterable[CategoricalDtype]]

If supplied, only unionize this subset of columns.

None

Raises:

Type Description
TypeError

If any of the inputs are not pandas DataFrames.

Returns:

Type Description
List[DataFrame]

A list of the category-unioned dataframes in the same order they were provided.

Source code in janitor/functions/utils.py
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
def unionize_dataframe_categories(
    *dataframes: Any,
    column_names: Optional[Iterable[pd.CategoricalDtype]] = None,
) -> List[pd.DataFrame]:
    """
    Given a group of dataframes which contain some categorical columns, for
    each categorical column present, find all the possible categories across
    all the dataframes which have that column.
    Update each dataframes' corresponding column with a new categorical object
    that contains the original data
    but has labels for all the possible categories from all dataframes.
    This is useful when concatenating a list of dataframes which all have the
    same categorical columns into one dataframe.

    If, for a given categorical column, all input dataframes do not have at
    least one instance of all the possible categories,
    Pandas will change the output dtype of that column from `category` to
    `object`, losing out on dramatic speed gains you get from the former
    format.

    Examples:
        Usage example for concatenation of categorical column-containing
        dataframes:

        Instead of:

        ```python
        concatenated_df = pd.concat([df1, df2, df3], ignore_index=True)
        ```

        which in your case has resulted in `category` -> `object` conversion,
        use:

        ```python
        unionized_dataframes = unionize_dataframe_categories(df1, df2, df2)
        concatenated_df = pd.concat(unionized_dataframes, ignore_index=True)
        ```

    Args:
        *dataframes: The dataframes you wish to unionize the categorical
            objects for.
        column_names: If supplied, only unionize this subset of columns.

    Raises:
        TypeError: If any of the inputs are not pandas DataFrames.

    Returns:
        A list of the category-unioned dataframes in the same order they
            were provided.
    """

    if any(not isinstance(df, pd.DataFrame) for df in dataframes):
        raise TypeError("Inputs must all be dataframes.")

    if column_names is None:
        # Find all columns across all dataframes that are categorical

        column_names = set()

        for dataframe in dataframes:
            column_names = column_names.union(
                [
                    column_name
                    for column_name in dataframe.columns
                    if isinstance(
                        dataframe[column_name].dtype, pd.CategoricalDtype
                    )
                ]
            )

    else:
        column_names = [column_names]
    # For each categorical column, find all possible values across the DFs

    category_unions = {
        column_name: union_categoricals(
            [df[column_name] for df in dataframes if column_name in df.columns]
        )
        for column_name in column_names
    }

    # Make a shallow copy of all DFs and modify the categorical columns
    # such that they can encode the union of all possible categories for each.

    refactored_dfs = []

    for df in dataframes:
        df = df.copy(deep=False)

        for column_name, categorical in category_unions.items():
            if column_name in df.columns:
                df[column_name] = pd.Categorical(
                    df[column_name], categories=categorical.categories
                )

        refactored_dfs.append(df)

    return refactored_dfs