DataFrame accessors¶

Basic manipulation¶

Type casting¶

cast.__call__(**dtypes) → pandas.core.frame.DataFrame¶

Convert dtypes of multiple columns using a dictionary

Parameters:	dtypes (dict) – Column name to data type mapping

Notes

You can also specify “index” and “datetime” on a column. Note that pandas does not have support for converting columns with NaNs to integer type. We will convert it to float automatically and indicate the user with a warning. https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html

Example

Suppose we have a dataframe

>>> import pandas as pd
>>> from pandas.api.types import CategoricalDtype
>>> import pandas_lightning
>>> df = pd.DataFrame({
...     "X": list("ABACBB"),
...     "Y": list("121092"),
...     "Z": ["hot","warm","hot","cold","cold","hot"]
... })
>>> df
   X  Y     Z
0  A  1   hot
1  B  2  warm
2  A  1   hot
3  C  0  cold
4  B  9  cold
5  B  2   hot

Change the types of the columns by writing

>>> df = df.cast(
...     X="category",  # this will be nominal
...     Y=int,
...     Z=["cold", "warm", "hot"],  # this will be ordinal
... )

which is equivalent to

>>> df["X"] = df["X"].astype("category")
>>> df["Y"] = df["Y"].astype(int)
>>> df["Z"] = df["Z"].astype(CategoricalDtype(
...                 ["cold", "warm", "hot"], ordered=True))

Returns:	A dataframe whose columns have been converted accordingly
Return type:	pandas.DataFrame

Column transformations¶

add_columns.__call__(**lambdas) → pandas.core.frame.DataFrame¶

Perform multiple lambda operations on series

Parameters:	lambdas (dict) – A dictionary of column name to lambda mapping inplace (bool, optional) – Whether to modify the series inplace, by default False

Examples

>>> import pandas as pd
>>> import pandas_lightning
>>> df = pd.DataFrame({"X": list("ABACBB"),
...                    "Y": list("121092"),
...                    "Z": ["hot","warm","hot","cold","cold","hot"]
... })
>>> df = df.cast(
...     Y=int,
...     Z=["cold", "warm", "hot"])

Example 1

Rewriting to the same column

>>> df = df.add_columns(
...     X=lambda s: s + "rea",
...     Y=lambda s: s+100,
...     Z=lambda s: s.str.upper())

which is the same as

>>> df["X"] = df["X"] + "rea"
>>> df["Y"] = df["Y"] + 100
>>> df["Z"] = df["Z"].str.upper()

Example 2

Rewriting to the another column

>>> df = df.add_columns(
...     X_new=("X", lambda s: s + "rea"),
...     Y_new=("Y", lambda s: s+100),
...     Z_new=("Z", lambda s: s.str.upper())
... )

which is the same as

>>> df["X_new"] = df["X"] + "rea"
>>> df["Y_new"] = df["Y"] + 100
>>> df["Z_new"] = df["Z"].str.upper()

Example 3

Work with more than 1 column at a time

>>> df.add_columns(
...     XY=(["X", "Y"],
...         lambda x, y: x + (y+10).astype(str),
...     YZ=(["Y", "Z"],
...         lambda y, z: z.astype(str) + "-" + y.astype(str),
... )

which is the same as

>>> df["XY"] = df["X"] + df["Y"].astype(str)
>>> df["YZ"] = df["Y"].astype(str) + "-" + df["Z"].astype(str)

Returns:	A transformed copy of the dataframe
Return type:	pandas.DataFrame
Raises:	`ValueError` – If lambdas is not a dict `ValueError` – [description] `ValueError` – [description]

drop_if_exist¶

lambdas.drop_if_exist(columns: list)¶

drop_columns_with_rules¶

lambdas.drop_columns_with_rules(*functions)¶

Drop a column if any of the conditions defined in the functions or lambdas are met

Parameters:	functions (functions or lambdas) – Functions or lambdas that take in a series as a parameter and returns a `bool`

Examples

>>> import pandas as pd
>>> import pandas_lightning
>>> import numpy as np
>>> df = pd.DataFrame({"X": [np.nan, np.nan, np.nan, np.nan, "hey"],
...                    "Y": [0, np.nan, 0, 0, 1],
...                    "Z": [1, 9, 5, 4, 2]})

One of the more common patterns is dropping a column that has more than a certain threshold.

>>> df.lambdas.drop_columns_with_rules(
...     lambda s: s.pctg.nans > 0.75,
...     lambda s: s.pctg.zeros > 0.5)
   Z
0  1
1  9
2  5
3  4
4  2

Returns:	Dataframe with dropped columns
Return type:	pandas.DataFrame

fillna¶

lambdas.fillna(**d)¶

Example

>>> df.lambdas.fillna(
...     Sex=lambda sex: sex.median(),
...     Age=(["Sex", "Pclass"], lambda group: group.median())
... )

setna¶

lambdas.setna(**conditions)¶

You would do this on an existing column

Returns:	A copy of the dataframe
Return type:	pandas.DataFrame
Raises:	`ValueError` – If wrong type is specified

map_conditional¶

lambdas.map_conditional(**mappings)¶

Map values from multiple columns based on conditional statements expressed as lambdas. Similar to numpy.select and numpy.where.

You would use this for if-elif-else statements. If your statement is only an if-else and it is short, you would want to use sapply instead.

Similar to pandas.Series.map.

Parameters:	lambdas (dict) – A nested dictionary where the key is the column names and the value is a dictionary of (newvalue: conditional statement) where the conditional statement is expressed as a lambda statement inplace (bool, optional) – Whether to modify the series inplace, by default False

Example

>>> df = pd.DataFrame({"X": [0, 5, 3, 3, 4, 1],
...                    "Y": [1, 2, 1, 0, 9, 2],
...                    "Z": ["hot","warm","hot","cold","cold","hot"]
... })

Here is an example that maps the values across two series and creates a new series W.

>>> df.lambdas.map_conditional(
...     W=(["X", "Y"], {
...         "green": lambda x, y: x + y > 1,
...         "orange": lambda x, y: x == 5,
...        }, "black")
... )
   X  Y     Z       W
0  0  1   hot   black
1  5  2  warm  orange
2  3  1   hot   green
3  3  0  cold   green
4  4  9  cold  orange
5  1  2   hot   black

Here is an example that changes the value of Z. Note that this is a contrived example and you would want to use the pandas.Series.map API instead.

>>> df.lambdas.map_conditional(
...     Z={"blue": lambda z: z == "cold",
...        "amber": lambda z: z == "warm",
...        "red": lambda z: z == "hot"})

Returns:	A transformed copy of the dataframe
Return type:	pandas.DataFrame

Charts¶

options¶

quickplot.options()¶

Call this method to see which graphs you can plot

>>> df.quickplot(numerical=["age"], categorical=["year"]).options

barplot¶

quickplot.barplot(**kwargs)¶

heatmap¶

quickplot.heatmap(**kwargs)¶

kdeplot¶

quickplot.kdeplot(**kwargs)¶

distplot¶

countplot¶

quickplot.countplot(**kwargs)¶

scatterplot¶

quickplot.scatterplot(**kwargs)¶

lineplot¶

quickplot.lineplot(**kwargs)¶

hexbinplot¶

quickplot.hexbinplot(**kwargs)¶

boxplot¶

quickplot.boxplot(**kwargs)¶

violinplot¶

quickplot.violinplot(**kwargs)¶

stripplot¶

quickplot.stripplot(**kwargs)¶

qqplot¶

quickplot.qqplot(**kwargs)¶

catplot¶

quickplot.catplot(**kwargs)¶

ridgeplot¶

quickplot.ridgeplot()¶

Tests¶

Basic info¶

tests.info(pctg=True, mode='df')¶: TODO put this method elsewhere

Tests for categorical data¶

tests.categorical(alpha=0.05)¶

Tests for numerical data¶

tests.numerical()¶

Cramers V test¶

tests.get_cramersv()¶

Preparing for modelling¶

to_X_y¶

to_X_y.__call__(*, target: str, nominal: str, nominal_max_cardinality: int, nans: str) → pandas.core.frame.DataFrame¶

Change everything to a numeric type

Parameters:	target (str) – Column name of target variable (for regression or classification) nominal (str) – Strategy to deal with nominal columns. One of ‘one-hot’, ‘label’, ‘keep’ or ‘drop’ nans (str) – Strategy to deal with missing values. One of ‘remove’ or ‘keep’.
Raises:	`KeyError` – [description] `ValueError` – [description]
Returns:	A dataframe fit for modelling :param nominal_max_cardinality:
Return type:	pd.DataFrame

undersample¶

dataset.undersample(col, replace=False, min_count=None, random_state=None)¶

oversample¶

dataset.oversample(col, max_count=None, random_state=None)¶: https://stackoverflow.com/questions/48373088/duplicating-training-examples-to-handle-class-imbalance-in-a-pandas-data-frame

DataFrame.optimize¶

drop_duplicate_columns¶

optimize.drop_duplicate_columns(inplace: bool = False)¶

Drop duplicate columns that have exactly the same values and datatype

Parameters:	inplace (bool, optional) – Whether to perform inplace operation, by default False
Returns:	Dataframe with no duplicate columns
Return type:	pandas.DataFrame

convert_categories¶

optimize.convert_categories(max_cardinality: int = 20, inplace: bool = False)¶

Convert columns to category whenever possible

Parameters:	max_cardinality (int, optional) – The maximum no. of uniques before a column can be converted to a category type, by default 20 inplace (bool, optional) – [description], by default False
Returns:	A transformed dataframe
Return type:	pandas.DataFrame

profile¶

optimize.profile(dry_run=True, max_cardinality=20)¶