Bases: ColumnTransformer
A transformer to one-hot encode categorical features via sklearn's OneHotEncoder
.
Essentially wraps the fit_transformer
and inverse_transform
methods of OneHotEncoder
to comply with the ColumnTransformer
interface.
Parameters:
Name |
Type |
Description |
Default |
drop |
Optional[Union[list, str]]
|
str or list of str, to pass to OneHotEncoder 's drop parameter.
|
None
|
Attributes:
Name |
Type |
Description |
missing_value |
Any
|
The value used to fill missing values in the data.
|
After applying the transformer, the following attributes will be populated:
Attributes:
Name |
Type |
Description |
original_column_name |
|
The name of the original column.
|
new_column_names |
|
The names of the columns generated by the transformer.
|
Source code in src/nhssynth/modules/dataloader/transformers/categorical.py
| class OHECategoricalTransformer(ColumnTransformer):
"""
A transformer to one-hot encode categorical features via sklearn's `OneHotEncoder`.
Essentially wraps the `fit_transformer` and `inverse_transform` methods of `OneHotEncoder` to comply with the `ColumnTransformer` interface.
Args:
drop: str or list of str, to pass to `OneHotEncoder`'s `drop` parameter.
Attributes:
missing_value: The value used to fill missing values in the data.
After applying the transformer, the following attributes will be populated:
Attributes:
original_column_name: The name of the original column.
new_column_names: The names of the columns generated by the transformer.
"""
def __init__(self, drop: Optional[Union[list, str]] = None) -> None:
super().__init__()
self._drop: Union[list, str] = drop
self._transformer: OneHotEncoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False, drop=self._drop)
self.missing_value: Any = None
def apply(self, data: pd.Series, missing_value: Optional[Any] = None) -> pd.DataFrame:
"""
Apply the transformer to the data via sklearn's `OneHotEncoder`'s `fit_transform` method. Name the new columns via manipulation of the original column name.
If `missing_value` is provided, fill missing values with this value before applying the transformer to ensure a new category is added.
Args:
data: The column of data to transform.
missing_value: The value learned by the `MetaTransformer` to represent missingness, this is only used as part of the `AugmentMissingnessStrategy`.
"""
self.original_column_name = data.name
if missing_value:
data = data.fillna(missing_value)
self.missing_value = missing_value
transformed_data = pd.DataFrame(
self._transformer.fit_transform(data.values.reshape(-1, 1)),
columns=self._transformer.get_feature_names_out(input_features=[data.name]),
)
self.new_column_names = transformed_data.columns
return transformed_data
def revert(self, data: pd.DataFrame) -> pd.DataFrame:
"""
Revert data to pre-transformer state via sklearn's `OneHotEncoder`'s `inverse_transform` method.
If `missing_value` is provided, replace instances of this value in the data with `np.nan` to ensure missing values are represented correctly in the case
where `missing_value` was 'modelled' and thus generated.
Args:
data: The full dataset including the column(s) to be reverted to their pre-transformer state.
Returns:
The dataset with a single categorical column that is analogous to the original column, with the same name, and without the generated one-hot columns.
"""
data[self.original_column_name] = pd.Series(
self._transformer.inverse_transform(data[self.new_column_names].values).flatten(),
index=data.index,
name=self.original_column_name,
)
if self.missing_value:
data[self.original_column_name] = data[self.original_column_name].replace(self.missing_value, np.nan)
return data.drop(self.new_column_names, axis=1)
|
Apply the transformer to the data via sklearn's OneHotEncoder
's fit_transform
method. Name the new columns via manipulation of the original column name.
If missing_value
is provided, fill missing values with this value before applying the transformer to ensure a new category is added.
Parameters:
Name |
Type |
Description |
Default |
data |
Series
|
The column of data to transform.
|
required
|
missing_value |
Optional[Any]
|
The value learned by the MetaTransformer to represent missingness, this is only used as part of the AugmentMissingnessStrategy .
|
None
|
Source code in src/nhssynth/modules/dataloader/transformers/categorical.py
| def apply(self, data: pd.Series, missing_value: Optional[Any] = None) -> pd.DataFrame:
"""
Apply the transformer to the data via sklearn's `OneHotEncoder`'s `fit_transform` method. Name the new columns via manipulation of the original column name.
If `missing_value` is provided, fill missing values with this value before applying the transformer to ensure a new category is added.
Args:
data: The column of data to transform.
missing_value: The value learned by the `MetaTransformer` to represent missingness, this is only used as part of the `AugmentMissingnessStrategy`.
"""
self.original_column_name = data.name
if missing_value:
data = data.fillna(missing_value)
self.missing_value = missing_value
transformed_data = pd.DataFrame(
self._transformer.fit_transform(data.values.reshape(-1, 1)),
columns=self._transformer.get_feature_names_out(input_features=[data.name]),
)
self.new_column_names = transformed_data.columns
return transformed_data
|
Revert data to pre-transformer state via sklearn's OneHotEncoder
's inverse_transform
method.
If missing_value
is provided, replace instances of this value in the data with np.nan
to ensure missing values are represented correctly in the case
where missing_value
was 'modelled' and thus generated.
Parameters:
Name |
Type |
Description |
Default |
data |
DataFrame
|
The full dataset including the column(s) to be reverted to their pre-transformer state.
|
required
|
Returns:
Type |
Description |
DataFrame
|
The dataset with a single categorical column that is analogous to the original column, with the same name, and without the generated one-hot columns.
|
Source code in src/nhssynth/modules/dataloader/transformers/categorical.py
| def revert(self, data: pd.DataFrame) -> pd.DataFrame:
"""
Revert data to pre-transformer state via sklearn's `OneHotEncoder`'s `inverse_transform` method.
If `missing_value` is provided, replace instances of this value in the data with `np.nan` to ensure missing values are represented correctly in the case
where `missing_value` was 'modelled' and thus generated.
Args:
data: The full dataset including the column(s) to be reverted to their pre-transformer state.
Returns:
The dataset with a single categorical column that is analogous to the original column, with the same name, and without the generated one-hot columns.
"""
data[self.original_column_name] = pd.Series(
self._transformer.inverse_transform(data[self.new_column_names].values).flatten(),
index=data.index,
name=self.original_column_name,
)
if self.missing_value:
data[self.original_column_name] = data[self.original_column_name].replace(self.missing_value, np.nan)
return data.drop(self.new_column_names, axis=1)
|