clean_data
Functions for cleaning and cleansing datasets
batch_normalise_column_names(datasets)
Normalise the column names for all datasets in the provided dictionary
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets
|
Dict[str, Dict[str, Any]]
|
A dictionary containing dataset names and their corresponding dataframes |
required |
Returns:
Type | Description |
---|---|
Dict[str, Dict[str, Any]]
|
The dictionary containing the normalised dataframes |
Source code in devices_rap/clean_data.py
cleanse_master_data(master_df)
Clean the master dataset ready for processing. This function will:
- Convert high level device type values
- Convert activity year values without century
- Convert activity date values to datetime
Parameters:
Name | Type | Description | Default |
---|---|---|---|
master_df
|
DataFrame
|
The master dataset to be cleaned. Must contain the following columns: - der_high_level_device_type - cln_activity_year |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
The cleaned master dataset |
Raises:
Type | Description |
---|---|
ColumnsNotFoundError
|
If the required columns are not present in the dataset |
Source code in devices_rap/clean_data.py
cleanse_master_joined_dataset(master_joined_df)
Clean the joined dataset ready for pivoting. This function will:
- Consolidate region columns into a single column, 'upd_region'
- Fix inconsistent 'upd_region' values by replacing '&' with 'and'
- Fill missing 'rag_status' values with 'RED' where 'upd_high_level_device_type' is missing
- Fill missing values with 'NULL' in the columns:
- rag_status
- upd_high_level_device
- cln_manufacturer
- cln_manufacturer_device_name
Parameters:
Name | Type | Description | Default |
---|---|---|---|
master_joined_df
|
DataFrame
|
The joined dataset to be cleaned. Must contain the following columns: - region - nhs_england_region - rag_status - upd_high_level_device_type - cln_manufacturer - cln_manufacturer_device_name |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
The cleaned joined dataset |
Raises:
Type | Description |
---|---|
ColumnsNotFoundError
|
If the required columns are not present in the dataset |
Source code in devices_rap/clean_data.py
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 |
|
cleanse_exceptions(exceptions_df, rag_priorities=None)
Clean the exceptions dataset ready for processing.
First, it will convert the handover date columns to datetime format. Then, it will remove duplicate exceptions by keeping the first occurrence of each provider and device code combination with the highest RAG status as defined the rag_priorities variable.
The rag_priorities variable is a list of RAG status priorities, with the default being:
- "AMBER"
- "RED"
- "YELLOW"
If the dataset contains additional RAG statuses, they will be added to the end of the list in alphabetical order.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
exceptions_df
|
DataFrame
|
The exceptions dataset to be cleaned. Must contain the following columns:
|
required |
rag_priorities
|
List[str]
|
The list of RAG status priorities, by default RAG_PRIORITIES |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
The cleaned exceptions dataset |
Source code in devices_rap/clean_data.py
cleanse_device_taxonomy(device_taxonomy)
Cleanses the device taxonomy DataFrame by converting 'Y'/'N' string values in specific columns to boolean values.
This function processes the 'migrated_categories' and 'non_migrated_categories' columns, converting their values to True for 'Y', False for 'N', and None for any other value. The results are stored in new columns: 'upd_migrated_categories' and 'upd_non_migrated_categories'.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device_taxonomy
|
DataFrame
|
The DataFrame containing device taxonomy data with 'migrated_categories' and 'non_migrated_categories' columns. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
The updated DataFrame with new columns for migrated and non-migrated categories containing boolean values. |
Source code in devices_rap/clean_data.py
convert_date_columns_to_datetime(data, date_columns)
Convert specified date columns in the dataframe to datetime format using the parse_dates function. This function will raise an error if any of the specified date columns are not present in the dataframe. The function will log the conversion process for each date column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
DataFrame
|
The dataframe containing the date columns to be converted. |
required |
date_columns
|
List[str]
|
A list of column names to be converted to datetime. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
The dataframe with specified date columns converted to datetime. |
Raises:
Type | Description |
---|---|
ColumnsNotFoundError
|
If any of the specified date columns are not present in the dataframe. |
Source code in devices_rap/clean_data.py
drop_duplicates_on_priority(data, subset, priority_column, priority_order)
This function will remove duplicate rows from the dataset by keeping the first occurrence of each unique value in the subset column(s) with the highest priority value in the priority_columns as defined in the priority_order variable.
If the dataset contains additional values in the priority_column not already specified in the priority_order variable, they will be added to the end of the list in alphabetical order.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
DataFrame
|
The dataset with duplicates to be removed. Must contain the columns specified in the subset and priority_column variables. |
required |
subset
|
str | List[str]
|
The column(s) to use to identify duplicates |
required |
priority_column
|
str
|
The column to use to determine the priority of the duplicates |
required |
priority_order
|
List[str]
|
The list of priority values to use when determining which duplicates to keep |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
The dataset with duplicates removed |
Raises:
Type | Description |
---|---|
ColumnsNotFoundError
|
If the required columns are not present in the dataset |
Source code in devices_rap/clean_data.py
check_duplicates(data, duplicate_severity='INFO', subset=None)
Function checks for duplicates in the dataset and raises an error, warning or logs an info with information about the number of duplicates found. The level of the message can be controlled by the duplicate_severity variable.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
DataFrame
|
The dataset to check for duplicates |
required |
duplicate_severity
|
Literal['ERROR'] | Literal['WARNING'] | Literal['INFO']
|
The severity of the message to raise, by default "INFO" |
'INFO'
|
subset
|
str | List[str]
|
The column(s) to use to identify duplicates, by default None |
None
|
Raises:
Type | Description |
---|---|
DuplicateDataError
|
If the severity is set to "ERROR" and duplicates are found |
DuplicateDataWarning
|
If the severity is set to "WARNING" and duplicates are found |
Side Effects
Logs a message with the number of duplicates found if severity is set to "INFO"