DataFrame accessors

Basic manipulation

Type casting

cast.__call__(**dtypes) → pandas.core.frame.DataFrame

Convert dtypes of multiple columns using a dictionary

Parameters:dtypes (dict) – Column name to data type mapping

Notes

You can also specify “index” and “datetime” on a column. Note that pandas does not have support for converting columns with NaNs to integer type. We will convert it to float automatically and indicate the user with a warning. https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html

Example

Suppose we have a dataframe

>>> import pandas as pd
>>> from pandas.api.types import CategoricalDtype
>>> import pandas_lightning
>>> df = pd.DataFrame({
...     "X": list("ABACBB"),
...     "Y": list("121092"),
...     "Z": ["hot","warm","hot","cold","cold","hot"]
... })
>>> df
   X  Y     Z
0  A  1   hot
1  B  2  warm
2  A  1   hot
3  C  0  cold
4  B  9  cold
5  B  2   hot

Change the types of the columns by writing

>>> df = df.cast(
...     X="category",  # this will be nominal
...     Y=int,
...     Z=["cold", "warm", "hot"],  # this will be ordinal
... )

which is equivalent to

>>> df["X"] = df["X"].astype("category")
>>> df["Y"] = df["Y"].astype(int)
>>> df["Z"] = df["Z"].astype(CategoricalDtype(
...                 ["cold", "warm", "hot"], ordered=True))
Returns:A dataframe whose columns have been converted accordingly
Return type:pandas.DataFrame

Column transformations

add_columns.__call__(**lambdas) → pandas.core.frame.DataFrame

Perform multiple lambda operations on series

Parameters:
  • lambdas (dict) – A dictionary of column name to lambda mapping
  • inplace (bool, optional) – Whether to modify the series inplace, by default False

Examples

>>> import pandas as pd
>>> import pandas_lightning
>>> df = pd.DataFrame({"X": list("ABACBB"),
...                    "Y": list("121092"),
...                    "Z": ["hot","warm","hot","cold","cold","hot"]
... })
>>> df = df.cast(
...     Y=int,
...     Z=["cold", "warm", "hot"])

Example 1

Rewriting to the same column

>>> df = df.add_columns(
...     X=lambda s: s + "rea",
...     Y=lambda s: s+100,
...     Z=lambda s: s.str.upper())

which is the same as

>>> df["X"] = df["X"] + "rea"
>>> df["Y"] = df["Y"] + 100
>>> df["Z"] = df["Z"].str.upper()

Example 2

Rewriting to the another column

>>> df = df.add_columns(
...     X_new=("X", lambda s: s + "rea"),
...     Y_new=("Y", lambda s: s+100),
...     Z_new=("Z", lambda s: s.str.upper())
... )

which is the same as

>>> df["X_new"] = df["X"] + "rea"
>>> df["Y_new"] = df["Y"] + 100
>>> df["Z_new"] = df["Z"].str.upper()

Example 3

Work with more than 1 column at a time

>>> df.add_columns(
...     XY=(["X", "Y"],
...         lambda x, y: x + (y+10).astype(str),
...     YZ=(["Y", "Z"],
...         lambda y, z: z.astype(str) + "-" + y.astype(str),
... )

which is the same as

>>> df["XY"] = df["X"] + df["Y"].astype(str)
>>> df["YZ"] = df["Y"].astype(str) + "-" + df["Z"].astype(str)
Returns:

A transformed copy of the dataframe

Return type:

pandas.DataFrame

Raises:
  • ValueError – If lambdas is not a dict
  • ValueError – [description]
  • ValueError – [description]

drop_if_exist

lambdas.drop_if_exist(columns: list)

drop_columns_with_rules

lambdas.drop_columns_with_rules(*functions)

Drop a column if any of the conditions defined in the functions or lambdas are met

Parameters:functions (functions or lambdas) – Functions or lambdas that take in a series as a parameter and returns a bool

Examples

>>> import pandas as pd
>>> import pandas_lightning
>>> import numpy as np
>>> df = pd.DataFrame({"X": [np.nan, np.nan, np.nan, np.nan, "hey"],
...                    "Y": [0, np.nan, 0, 0, 1],
...                    "Z": [1, 9, 5, 4, 2]})

One of the more common patterns is dropping a column that has more than a certain threshold.

>>> df.lambdas.drop_columns_with_rules(
...     lambda s: s.pctg.nans > 0.75,
...     lambda s: s.pctg.zeros > 0.5)
   Z
0  1
1  9
2  5
3  4
4  2
Returns:Dataframe with dropped columns
Return type:pandas.DataFrame

fillna

lambdas.fillna(**d)

Example

>>> df.lambdas.fillna(
...     Sex=lambda sex: sex.median(),
...     Age=(["Sex", "Pclass"], lambda group: group.median())
... )

setna

lambdas.setna(**conditions)

You would do this on an existing column

Returns:A copy of the dataframe
Return type:pandas.DataFrame
Raises:ValueError – If wrong type is specified

map_conditional

lambdas.map_conditional(**mappings)

Map values from multiple columns based on conditional statements expressed as lambdas. Similar to numpy.select and numpy.where.

You would use this for if-elif-else statements. If your statement is only an if-else and it is short, you would want to use sapply instead.

Similar to pandas.Series.map.

Parameters:
  • lambdas (dict) – A nested dictionary where the key is the column names and the value is a dictionary of (newvalue: conditional statement) where the conditional statement is expressed as a lambda statement
  • inplace (bool, optional) – Whether to modify the series inplace, by default False

Example

>>> df = pd.DataFrame({"X": [0, 5, 3, 3, 4, 1],
...                    "Y": [1, 2, 1, 0, 9, 2],
...                    "Z": ["hot","warm","hot","cold","cold","hot"]
... })

Here is an example that maps the values across two series and creates a new series W.

>>> df.lambdas.map_conditional(
...     W=(["X", "Y"], {
...         "green": lambda x, y: x + y > 1,
...         "orange": lambda x, y: x == 5,
...        }, "black")
... )
   X  Y     Z       W
0  0  1   hot   black
1  5  2  warm  orange
2  3  1   hot   green
3  3  0  cold   green
4  4  9  cold  orange
5  1  2   hot   black

Here is an example that changes the value of Z. Note that this is a contrived example and you would want to use the pandas.Series.map API instead.

>>> df.lambdas.map_conditional(
...     Z={"blue": lambda z: z == "cold",
...        "amber": lambda z: z == "warm",
...        "red": lambda z: z == "hot"})
Returns:A transformed copy of the dataframe
Return type:pandas.DataFrame

Charts

options

quickplot.options()

Call this method to see which graphs you can plot

>>> df.quickplot(numerical=["age"], categorical=["year"]).options

barplot

quickplot.barplot(**kwargs)

heatmap

quickplot.heatmap(**kwargs)

kdeplot

quickplot.kdeplot(**kwargs)

distplot

countplot

quickplot.countplot(**kwargs)

scatterplot

quickplot.scatterplot(**kwargs)

lineplot

quickplot.lineplot(**kwargs)

hexbinplot

quickplot.hexbinplot(**kwargs)

boxplot

quickplot.boxplot(**kwargs)

violinplot

quickplot.violinplot(**kwargs)

stripplot

quickplot.stripplot(**kwargs)

qqplot

quickplot.qqplot(**kwargs)

catplot

quickplot.catplot(**kwargs)

ridgeplot

quickplot.ridgeplot()

Tests

Basic info

tests.info(pctg=True, mode='df')

TODO put this method elsewhere

Tests for categorical data

tests.categorical(alpha=0.05)

Tests for numerical data

tests.numerical()

Cramers V test

tests.get_cramersv()

Preparing for modelling

to_X_y

to_X_y.__call__(*, target: str, nominal: str, nominal_max_cardinality: int, nans: str) → pandas.core.frame.DataFrame

Change everything to a numeric type

Parameters:
  • target (str) – Column name of target variable (for regression or classification)
  • nominal (str) – Strategy to deal with nominal columns. One of ‘one-hot’, ‘label’, ‘keep’ or ‘drop’
  • nans (str) – Strategy to deal with missing values. One of ‘remove’ or ‘keep’.
Raises:
  • KeyError – [description]
  • ValueError – [description]
Returns:

A dataframe fit for modelling :param nominal_max_cardinality:

Return type:

pd.DataFrame

undersample

dataset.undersample(col, replace=False, min_count=None, random_state=None)

DataFrame.optimize

drop_duplicate_columns

optimize.drop_duplicate_columns(inplace: bool = False)

Drop duplicate columns that have exactly the same values and datatype

Parameters:inplace (bool, optional) – Whether to perform inplace operation, by default False
Returns:Dataframe with no duplicate columns
Return type:pandas.DataFrame

convert_categories

optimize.convert_categories(max_cardinality: int = 20, inplace: bool = False)

Convert columns to category whenever possible

Parameters:
  • max_cardinality (int, optional) – The maximum no. of uniques before a column can be converted to a category type, by default 20
  • inplace (bool, optional) – [description], by default False
Returns:

A transformed dataframe

Return type:

pandas.DataFrame

profile

optimize.profile(dry_run=True, max_cardinality=20)