Skip to content

categorical

OHECategoricalTransformer

Bases: ColumnTransformer

A transformer to one-hot encode categorical features via sklearn's OneHotEncoder. Essentially wraps the fit_transformer and inverse_transform methods of OneHotEncoder to comply with the ColumnTransformer interface.

Parameters:

Name Type Description Default
drop Optional[Union[list, str]]

str or list of str, to pass to OneHotEncoder's drop parameter.

None

Attributes:

Name Type Description
missing_value Any

The value used to fill missing values in the data.

After applying the transformer, the following attributes will be populated:

Attributes:

Name Type Description
original_column_name

The name of the original column.

new_column_names

The names of the columns generated by the transformer.

Source code in src/nhssynth/modules/dataloader/transformers/categorical.py
class OHECategoricalTransformer(ColumnTransformer):
    """
    A transformer to one-hot encode categorical features via sklearn's `OneHotEncoder`.
    Essentially wraps the `fit_transformer` and `inverse_transform` methods of `OneHotEncoder` to comply with the `ColumnTransformer` interface.

    Args:
        drop: str or list of str, to pass to `OneHotEncoder`'s `drop` parameter.

    Attributes:
        missing_value: The value used to fill missing values in the data.

    After applying the transformer, the following attributes will be populated:

    Attributes:
        original_column_name: The name of the original column.
        new_column_names: The names of the columns generated by the transformer.
    """

    def __init__(self, drop: Optional[Union[list, str]] = None) -> None:
        super().__init__()
        self._drop: Union[list, str] = drop
        self._transformer: OneHotEncoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False, drop=self._drop)
        self.missing_value: Any = None

    def apply(self, data: pd.Series, missing_value: Optional[Any] = None) -> pd.DataFrame:
        """
        Apply the transformer to the data via sklearn's `OneHotEncoder`'s `fit_transform` method. Name the new columns via manipulation of the original column name.
        If `missing_value` is provided, fill missing values with this value before applying the transformer to ensure a new category is added.

        Args:
            data: The column of data to transform.
            missing_value: The value learned by the `MetaTransformer` to represent missingness, this is only used as part of the `AugmentMissingnessStrategy`.
        """
        self.original_column_name = data.name
        if missing_value:
            data = data.fillna(missing_value)
            self.missing_value = missing_value
        transformed_data = pd.DataFrame(
            self._transformer.fit_transform(data.values.reshape(-1, 1)),
            columns=self._transformer.get_feature_names_out(input_features=[data.name]),
        )
        self.new_column_names = transformed_data.columns
        return transformed_data

    def revert(self, data: pd.DataFrame) -> pd.DataFrame:
        """
        Revert data to pre-transformer state via sklearn's `OneHotEncoder`'s `inverse_transform` method.
        If `missing_value` is provided, replace instances of this value in the data with `np.nan` to ensure missing values are represented correctly in the case
        where `missing_value` was 'modelled' and thus generated.

        Args:
            data: The full dataset including the column(s) to be reverted to their pre-transformer state.

        Returns:
            The dataset with a single categorical column that is analogous to the original column, with the same name, and without the generated one-hot columns.
        """
        data[self.original_column_name] = pd.Series(
            self._transformer.inverse_transform(data[self.new_column_names].values).flatten(),
            index=data.index,
            name=self.original_column_name,
        )
        if self.missing_value:
            data[self.original_column_name] = data[self.original_column_name].replace(self.missing_value, np.nan)
        return data.drop(self.new_column_names, axis=1)

apply(data, missing_value=None)

Apply the transformer to the data via sklearn's OneHotEncoder's fit_transform method. Name the new columns via manipulation of the original column name. If missing_value is provided, fill missing values with this value before applying the transformer to ensure a new category is added.

Parameters:

Name Type Description Default
data Series

The column of data to transform.

required
missing_value Optional[Any]

The value learned by the MetaTransformer to represent missingness, this is only used as part of the AugmentMissingnessStrategy.

None
Source code in src/nhssynth/modules/dataloader/transformers/categorical.py
def apply(self, data: pd.Series, missing_value: Optional[Any] = None) -> pd.DataFrame:
    """
    Apply the transformer to the data via sklearn's `OneHotEncoder`'s `fit_transform` method. Name the new columns via manipulation of the original column name.
    If `missing_value` is provided, fill missing values with this value before applying the transformer to ensure a new category is added.

    Args:
        data: The column of data to transform.
        missing_value: The value learned by the `MetaTransformer` to represent missingness, this is only used as part of the `AugmentMissingnessStrategy`.
    """
    self.original_column_name = data.name
    if missing_value:
        data = data.fillna(missing_value)
        self.missing_value = missing_value
    transformed_data = pd.DataFrame(
        self._transformer.fit_transform(data.values.reshape(-1, 1)),
        columns=self._transformer.get_feature_names_out(input_features=[data.name]),
    )
    self.new_column_names = transformed_data.columns
    return transformed_data

revert(data)

Revert data to pre-transformer state via sklearn's OneHotEncoder's inverse_transform method. If missing_value is provided, replace instances of this value in the data with np.nan to ensure missing values are represented correctly in the case where missing_value was 'modelled' and thus generated.

Parameters:

Name Type Description Default
data DataFrame

The full dataset including the column(s) to be reverted to their pre-transformer state.

required

Returns:

Type Description
DataFrame

The dataset with a single categorical column that is analogous to the original column, with the same name, and without the generated one-hot columns.

Source code in src/nhssynth/modules/dataloader/transformers/categorical.py
def revert(self, data: pd.DataFrame) -> pd.DataFrame:
    """
    Revert data to pre-transformer state via sklearn's `OneHotEncoder`'s `inverse_transform` method.
    If `missing_value` is provided, replace instances of this value in the data with `np.nan` to ensure missing values are represented correctly in the case
    where `missing_value` was 'modelled' and thus generated.

    Args:
        data: The full dataset including the column(s) to be reverted to their pre-transformer state.

    Returns:
        The dataset with a single categorical column that is analogous to the original column, with the same name, and without the generated one-hot columns.
    """
    data[self.original_column_name] = pd.Series(
        self._transformer.inverse_transform(data[self.new_column_names].values).flatten(),
        index=data.index,
        name=self.original_column_name,
    )
    if self.missing_value:
        data[self.original_column_name] = data[self.original_column_name].replace(self.missing_value, np.nan)
    return data.drop(self.new_column_names, axis=1)