continuous
ClusterContinuousTransformer
Bases: ColumnTransformer
A transformer to cluster continuous features via sklearn's BayesianGaussianMixture
.
Essentially wraps the process of fitting the BGM model and generating cluster assignments and normalised values for the data to comply with the ColumnTransformer
interface.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_components |
int
|
The number of components to use in the BGM model. |
10
|
n_init |
int
|
The number of initialisations to use in the BGM model. |
1
|
init_params |
str
|
The initialisation method to use in the BGM model. |
'kmeans'
|
random_state |
int
|
The random state to use in the BGM model. |
0
|
max_iter |
int
|
The maximum number of iterations to use in the BGM model. |
1000
|
remove_unused_components |
bool
|
Whether to remove components that have no data assigned EXPERIMENTAL. |
False
|
clip_output |
bool
|
Whether to clip the output normalised values to the range [-1, 1]. |
False
|
After applying the transformer, the following attributes will be populated:
Attributes:
Name | Type | Description |
---|---|---|
means |
The means of the components in the BGM model. |
|
stds |
The standard deviations of the components in the BGM model. |
|
new_column_names |
The names of the columns generated by the transformer (one for the normalised values and one for each cluster component). |
Source code in src/nhssynth/modules/dataloader/transformers/continuous.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
|
apply(data, missingness_column=None)
Apply the transformer to the data via sklearn's BayesianGaussianMixture
's fit
and predict_proba
methods.
Name the new columns via the original column name.
If missingness_column
is provided, use this to extract the non-missing data; the missing values are assigned to a new pseudo-cluster with mean 0
(i.e. all values in the normalised column are 0.0). We do this by taking the full index before subsetting to non-missing data, then reindexing.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
Series
|
The column of data to transform. |
required |
missingness_column |
Optional[Series]
|
The column of data representing missingness, this is only used as part of the |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
The transformed data (will be multiple columns if |
Source code in src/nhssynth/modules/dataloader/transformers/continuous.py
revert(data)
Revert data to pre-transformer state via the means and stds of the BGM. Extract the relevant columns from the data via the new_column_names
attribute.
If missingness_column
was provided to the apply
method, drop the missing values from the data before reverting and use the full_index
to
reintroduce missing values when original_column_name
is constructed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
DataFrame
|
The full dataset including the column(s) to be reverted to their pre-transformer state. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
The dataset with a single continuous column that is analogous to the original column, with the same name, and without the generated columns from which it is derived. |