DataFrame accessors¶
Basic manipulation¶
Type casting¶
-
cast.__call__(**dtypes) → pandas.core.frame.DataFrame¶ Convert dtypes of multiple columns using a dictionary
Parameters: dtypes (dict) – Column name to data type mapping Notes
You can also specify “index” and “datetime” on a column. Note that pandas does not have support for converting columns with NaNs to integer type. We will convert it to float automatically and indicate the user with a warning. https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html
Example
Suppose we have a dataframe
>>> import pandas as pd >>> from pandas.api.types import CategoricalDtype >>> import pandas_lightning >>> df = pd.DataFrame({ ... "X": list("ABACBB"), ... "Y": list("121092"), ... "Z": ["hot","warm","hot","cold","cold","hot"] ... }) >>> df X Y Z 0 A 1 hot 1 B 2 warm 2 A 1 hot 3 C 0 cold 4 B 9 cold 5 B 2 hot
Change the types of the columns by writing
>>> df = df.cast( ... X="category", # this will be nominal ... Y=int, ... Z=["cold", "warm", "hot"], # this will be ordinal ... )
which is equivalent to
>>> df["X"] = df["X"].astype("category") >>> df["Y"] = df["Y"].astype(int) >>> df["Z"] = df["Z"].astype(CategoricalDtype( ... ["cold", "warm", "hot"], ordered=True))
Returns: A dataframe whose columns have been converted accordingly Return type: pandas.DataFrame
Column transformations¶
-
add_columns.__call__(**lambdas) → pandas.core.frame.DataFrame¶ Perform multiple lambda operations on series
Parameters: - lambdas (dict) – A dictionary of column name to lambda mapping
- inplace (bool, optional) – Whether to modify the series inplace, by default False
Examples
>>> import pandas as pd >>> import pandas_lightning >>> df = pd.DataFrame({"X": list("ABACBB"), ... "Y": list("121092"), ... "Z": ["hot","warm","hot","cold","cold","hot"] ... }) >>> df = df.cast( ... Y=int, ... Z=["cold", "warm", "hot"])
Example 1
Rewriting to the same column
>>> df = df.add_columns( ... X=lambda s: s + "rea", ... Y=lambda s: s+100, ... Z=lambda s: s.str.upper())
which is the same as
>>> df["X"] = df["X"] + "rea" >>> df["Y"] = df["Y"] + 100 >>> df["Z"] = df["Z"].str.upper()
Example 2
Rewriting to the another column
>>> df = df.add_columns( ... X_new=("X", lambda s: s + "rea"), ... Y_new=("Y", lambda s: s+100), ... Z_new=("Z", lambda s: s.str.upper()) ... )
which is the same as
>>> df["X_new"] = df["X"] + "rea" >>> df["Y_new"] = df["Y"] + 100 >>> df["Z_new"] = df["Z"].str.upper()
Example 3
Work with more than 1 column at a time
>>> df.add_columns( ... XY=(["X", "Y"], ... lambda x, y: x + (y+10).astype(str), ... YZ=(["Y", "Z"], ... lambda y, z: z.astype(str) + "-" + y.astype(str), ... )
which is the same as
>>> df["XY"] = df["X"] + df["Y"].astype(str) >>> df["YZ"] = df["Y"].astype(str) + "-" + df["Z"].astype(str)
Returns: A transformed copy of the dataframe
Return type: pandas.DataFrame
Raises: ValueError– If lambdas is not a dictValueError– [description]ValueError– [description]
drop_columns_with_rules¶
-
lambdas.drop_columns_with_rules(*functions)¶ Drop a column if any of the conditions defined in the functions or lambdas are met
Parameters: functions (functions or lambdas) – Functions or lambdas that take in a series as a parameter and returns a boolExamples
>>> import pandas as pd >>> import pandas_lightning >>> import numpy as np >>> df = pd.DataFrame({"X": [np.nan, np.nan, np.nan, np.nan, "hey"], ... "Y": [0, np.nan, 0, 0, 1], ... "Z": [1, 9, 5, 4, 2]})
One of the more common patterns is dropping a column that has more than a certain threshold.
>>> df.lambdas.drop_columns_with_rules( ... lambda s: s.pctg.nans > 0.75, ... lambda s: s.pctg.zeros > 0.5) Z 0 1 1 9 2 5 3 4 4 2
Returns: Dataframe with dropped columns Return type: pandas.DataFrame
fillna¶
-
lambdas.fillna(**d)¶ Example
>>> df.lambdas.fillna( ... Sex=lambda sex: sex.median(), ... Age=(["Sex", "Pclass"], lambda group: group.median()) ... )
setna¶
-
lambdas.setna(**conditions)¶ You would do this on an existing column
Returns: A copy of the dataframe Return type: pandas.DataFrame Raises: ValueError– If wrong type is specified
map_conditional¶
-
lambdas.map_conditional(**mappings)¶ Map values from multiple columns based on conditional statements expressed as lambdas. Similar to numpy.select and numpy.where.
You would use this for if-elif-else statements. If your statement is only an if-else and it is short, you would want to use sapply instead.
Similar to pandas.Series.map.
Parameters: - lambdas (dict) – A nested dictionary where the key is the column names and the value is a dictionary of (newvalue: conditional statement) where the conditional statement is expressed as a lambda statement
- inplace (bool, optional) – Whether to modify the series inplace, by default False
Example
>>> df = pd.DataFrame({"X": [0, 5, 3, 3, 4, 1], ... "Y": [1, 2, 1, 0, 9, 2], ... "Z": ["hot","warm","hot","cold","cold","hot"] ... })
Here is an example that maps the values across two series and creates a new series W.
>>> df.lambdas.map_conditional( ... W=(["X", "Y"], { ... "green": lambda x, y: x + y > 1, ... "orange": lambda x, y: x == 5, ... }, "black") ... ) X Y Z W 0 0 1 hot black 1 5 2 warm orange 2 3 1 hot green 3 3 0 cold green 4 4 9 cold orange 5 1 2 hot black
Here is an example that changes the value of Z. Note that this is a contrived example and you would want to use the pandas.Series.map API instead.
>>> df.lambdas.map_conditional( ... Z={"blue": lambda z: z == "cold", ... "amber": lambda z: z == "warm", ... "red": lambda z: z == "hot"})
Returns: A transformed copy of the dataframe Return type: pandas.DataFrame
Charts¶
options¶
-
quickplot.options()¶ Call this method to see which graphs you can plot
>>> df.quickplot(numerical=["age"], categorical=["year"]).options
distplot¶
Tests¶
Preparing for modelling¶
to_X_y¶
-
to_X_y.__call__(*, target: str, nominal: str, nominal_max_cardinality: int, nans: str) → pandas.core.frame.DataFrame¶ Change everything to a numeric type
Parameters: - target (str) – Column name of target variable (for regression or classification)
- nominal (str) – Strategy to deal with nominal columns. One of ‘one-hot’, ‘label’, ‘keep’ or ‘drop’
- nans (str) – Strategy to deal with missing values. One of ‘remove’ or ‘keep’.
Raises: KeyError– [description]ValueError– [description]
Returns: A dataframe fit for modelling :param nominal_max_cardinality:
Return type: pd.DataFrame
DataFrame.optimize¶
drop_duplicate_columns¶
-
optimize.drop_duplicate_columns(inplace: bool = False)¶ Drop duplicate columns that have exactly the same values and datatype
Parameters: inplace (bool, optional) – Whether to perform inplace operation, by default False Returns: Dataframe with no duplicate columns Return type: pandas.DataFrame
convert_categories¶
-
optimize.convert_categories(max_cardinality: int = 20, inplace: bool = False)¶ Convert columns to category whenever possible
Parameters: - max_cardinality (int, optional) – The maximum no. of uniques before a column can be converted to a category type, by default 20
- inplace (bool, optional) – [description], by default False
Returns: A transformed dataframe
Return type: pandas.DataFrame