3. Data preparation
3.1 Profiling
Data profiling is the process of examining the data and collecting statistics and information about that data. This is important for selecting the right blocking and matching variables and ensure accurate linkage results.
Data Profiling
This includes establishing:
- Data dictionary (data types, data meaning, etc)
- Completeness (presence of missing or null values)
- Uniqueness / cardinality (how many and which unique values a variable can have)
- Validity of variable values (the values are included in the data dictionary, are valid, have a meaning)
- Distribution of values / outliers
- Basic statistics (min, max, mean, variance)
- Correlations and functional dependencies between variables
🔴 RED:
No data profiling is performed before linking, leading to potential inconsistencies in the linked data.
🟡 AMBER:
Some data profiling is performed before linking, but it could be more thorough and systematic.
🟢 GREEN:
Comprehensive data profiling is performed before linking, identifying and resolving inconsistencies in datasets to be linked.
Data Dictionary Creation
import pandas as pd
input_data = [[1, 20240224, 'sample_name']]
input_schema = ['id', 'dob', 'name']
input_df = pd.DataFrame(input_data, columns = input_schema)
def get_data_types(df):
'''
function to get data types of columns in a dataframe
parameters:
df: Input DataFrame
returns:
DataFrame: DataFrame with columns 'Column' and 'Data Type'
'''
data_types = [[col_name, str(col_type)] for col_name, col_type in zip(df.columns, df.dtypes)]
schema = ['Column', 'Data Type']
result_df = pd.DataFrame(data_types, columns = schema)
return result_df
output = get_data_types(input_df)
print(output)
Column Data Type
0 id int64
1 dob int64
2 name object
Completeness
import pandas as pd
input_data = [[1, 20240224, 'sample_name'], [None, 20240224, 'sample_name'],[1, 20240224, None], [1, 20240224, None]]
input_schema = ['id', 'dob', 'name']
input_df = pd.DataFrame(input_data, columns = input_schema)
def null_percentage(df):
'''
Function to calculate the percentage of null values in each column of a pandas DataFrame.
Parameters:
df (DataFrame): Input DataFrame
Returns:
DataFrame: DataFrame with columns 'Column' and 'Null Percentage'
'''
total_rows = len(df)
null_counts = df.isnull().sum()
null_percentages = (null_counts / total_rows) * 100
result_df = pd.DataFrame({'Column': null_percentages.index, 'Null Percentage': null_percentages.values})
return result_df
output = null_percentage(input_df)
print(output)
Column Null Percentage
0 id 25.0
1 dob 0.0
2 name 50.0
Uniqueness
import pandas as pd
input_data = [[1, 20240224, 'sample_name'], [None, 20240224, 'sample_name2'],[3, 20240224, None], [2, 20240224, None]]
input_schema = ['id', 'dob', 'name']
input_df = pd.DataFrame(input_data, columns = input_schema)
def uniqueness(df):
'''
Function to calculate the uniqueness of values in each column of a pandas DataFrame.
Parameters:
df (DataFrame): Input DataFrame
Returns:
DataFrame: DataFrame with columns 'Column' and 'Uniqueness'
'''
unique_counts = df.nunique()
total_rows = len(df)
uniqueness_percentages = (unique_counts / total_rows) * 100
result_df = pd.DataFrame({'Column': uniqueness_percentages.index, '# of Unique Values': unique_counts})
return result_df
output = uniqueness(input_df)
print(output)
Column # of Unique Values
id id 3
dob dob 1
name name 2
Cardinality
import pandas as pd
input_data = [[1, 20240224, 'sample_name'], [None, 20240224, 'sample_name2'],[3, 20240224, 'sample_name2'], [2, 20240224, None]]
input_schema = ['id', 'dob', 'name']
input_df = pd.DataFrame(input_data, columns = input_schema)
def cardinality(df, column_name):
'''
Function to calculate find what values a variable can have and how many times they appear each.
Parameters:
df (DataFrame): Input DataFrame
column_name (str): name of the column that you want to know the cardinality of.
Returns:
DataFrame: DataFrame with columns 'Value' and 'Count'
'''
value_counts = df[column_name].value_counts().reset_index()
value_counts.columns = [column_name, 'count']
return value_counts
output = cardinality(input_df, 'name')
print(output)
name count
0 sample_name2 2
1 sample_name 1
Basic Stats
import pandas as pd
input_data = [[1, 'john', 23, 51], [2, 'lucy', 18, 103],[3,'claire', 55, 87], [4, 'sam', 44, 70]]
input_schema = ['id', 'name','age', 'weight']
input_df = pd.DataFrame(input_data, columns = input_schema)
def basic_stats(df):
'''
Function to calculate the basic stats of values in each numerical column of a pandas DataFrame.
Parameters:
df (DataFrame): Input DataFrame
Returns:
DataFrame: DataFrame with columns 'Column', 'Mean', 'Median', 'Mode', 'Standard Deviation', 'Minimum', 'Maximum'
'''
numerical_columns = df.select_dtypes(include=['number']).columns.tolist()
rows = []
for column in numerical_columns:
mean = df[column].mean()
median = df[column].median()
std = df[column].std()
min = df[column].min()
max = df[column].max()
rows.append([column, mean, median, std, min, max])
schema = ['column_name', 'mean', 'median', 'std', 'min', 'max']
stats = pd.DataFrame(rows, columns = schema)
return stats
output = basic_stats(input_df)
print(output)
column_name mean median std min max
0 id 2.50 2.5 1.290994 1 4
1 age 35.00 33.5 17.454703 18 55
2 weight 77.75 78.5 22.351361 51 103
3.2 Assessment
Assessing the findings from data profiling means using that information to evaluate if the input data is fit for linkage purposes, and how the data profile affect the modelling choices for linkage.
Data assessment
This includes (but it is not limited to):
- Assessing if variables have inconsistent values across data sets, and how to address this issue
- Deciding which variables are most suitable to be used as blocking variables and for distance calculations, considering completeness, quality and validity
- Decide which variables to use if there are multiple variables carrying similar information (i.e. correlated)
🔴 RED:
Data profiling results have not been assessed. There has been no effort to evaluate the input data for its suitability for linkage purposes. There is no consideration given to the consistency of variables across datasets, the selection of suitable blocking variables, or the handling of correlated variables.
🟡 AMBER:
Some level of data assessment has been conducted, but it is incomplete or inconsistent. There may be efforts to evaluate certain aspects of the data, such as identifying inconsistent variable values or selecting blocking variables, but these assessments are not comprehensive.
🟢 GREEN:
A rigorous assessment was carried out. Findings are documented and inform subsequent decisions on the linkage methods.
Profiling Assessment
Table 1
gender count
0 1 2238
1 2 1500
2 9 387
3 0 100
Table 2
gender count
0 1 2555
1 2 3002
2 0 233
Blocking Rules Assessment
column Null percentage
0 given_name 10%
1 gender 1%
2 postcode 30%
Correlated Values Assessment
Unique_ID street postcode outcode city
0 1 Portland Street LS31BL LS3 Leeds
1 2 Wellington Place LS14AP LS1 Leeds
3.3 Enrichment
Data enrichment is the process of enhancing the quality and usefulness of data for data linkage. It involves transforming, standardising, filtering, and cleaning the data to optimise its value.
Data enrichment
- Transform: Create derived variables, recode categories, adjust formats, reshape data structures to ensure compatibility.
- Standardise: Convert formats (dates, text case), map variable categories, address recording inconsistencies by applying standardisation techniques.
- Exclude: Protect privacy and accuracy by removing sensitive data, national data opt-outs, and irrelevant fields.
- Clean: Identify and correct errors, missing values, inconsistencies, and outliers.
Document the results of data transformations and standardisation to maintain transparency and reproducibility.
🔴 RED:
There is no structured approach for data transformation.
If transformations have been applied, they are scattered in the code, and have not been identified or documented.
No efforts have been made to identify and exclude some sensitive, irrelevant or incorrect data.
🟡 AMBER:
While basic procedures for data transformation are in place, they aren't comprehensive.
Some standardised practices can be observed, but partially documented.
Efforts have been made to identify and exclude some sensitive, irrelevant or incorrect data, but the approach lacks comprehensiveness.
🟢 GREEN:
Data is systematically reviewed for the need of being transformed, standardised, filtered and cleansed.
The processes are documented and identified in the code.
Transform gender values
import pandas as pd
gender_untransformed_schema = ['patient_id', 'gender']
untransformed_gender_data = [[1001, 1],[1002, 0],[1003, 2],[1004,1],[1005, 1],[1006, 9],[1007, 2],[1008,9]]
untransformed_gender_df = pd.DataFrame(untransformed_gender_data, columns = gender_untransformed_schema)
def transform_gender_formatting(df, column_to_standardise):
'''
Converts encodings of 9 as 'unknown' for gender to 0
'''
df['transformed_gender'] = df[column_to_standardise].replace(9, 0)
return df
transformed_df = transform_gender_formatting(untransformed_gender_df, 'gender')
print("Untransformed gender")
print(untransformed_gender_df)
print('\n')
print(transformed_df)
Untransformed gender
patient_id gender transformed_gender
0 1001 1 1
1 1002 0 0
2 1003 2 2
3 1004 1 1
4 1005 1 1
5 1006 9 0
6 1007 2 2
7 1008 9 0
patient_id gender transformed_gender
0 1001 1 1
1 1002 0 0
2 1003 2 2
3 1004 1 1
4 1005 1 1
5 1006 9 0
6 1007 2 2
7 1008 9 0
In this example, a decision was made to standardise the gender encodings of '9' of unknown to '0' which codes to unspecified, of which there were some pre-existing values in the table. This standardisation could happen if there are multiple encodings for 'other' in one table, but only one mapping of 'other' in another.
Standardise text case
import pandas as pd
names_unstandardised = ['patient_id', 'name']
unstandardised_names_data = [[1001, 'alan'],[1002, 'BOB'],[1003, 'carla'],[1004,'David'],[1005, 'ERIC'],[1006, 'frances'],[1007, 'gReg'],[1008,'Hannah']]
unstandardised_names_df = pd.DataFrame(unstandardised_names_data, columns = names_unstandardised)
def standardise_name_formatting(df, column_to_standardise):
'''
Converts all text cases to upper case for names
'''
df['standardised_name'] = df[column_to_standardise].str.upper()
return df
standardised_df = standardise_name_formatting(unstandardised_names_df, 'name')
print("Unstandardised gender")
print(unstandardised_names_df)
print('\n')
print(standardised_df)
Unstandardised gender
patient_id name standardised_name
0 1001 alan ALAN
1 1002 BOB BOB
2 1003 carla CARLA
3 1004 David DAVID
4 1005 ERIC ERIC
5 1006 frances FRANCES
6 1007 gReg GREG
7 1008 Hannah HANNAH
patient_id name standardised_name
0 1001 alan ALAN
1 1002 BOB BOB
2 1003 carla CARLA
3 1004 David DAVID
4 1005 ERIC ERIC
5 1006 frances FRANCES
6 1007 gReg GREG
7 1008 Hannah HANNAH
Exclude records for linkage by patient consent
import pandas as pd
patient_consent = ['patient_id', 'consent']
unfiltered_patient_data = [[1001, True],[1002, True],[1003, True],[1004,True],[1005, False],[1006, True],[1007, True],[1008,True]]
unfiltered_consent_df = pd.DataFrame(unfiltered_patient_data, columns = patient_consent)
def filter_to_consenting_patients(df, patient_consent_column):
'''
Removes any patients who have not consented to be contacted from the linkage
'''
return df[df[patient_consent_column]]
consenting_patients_df = filter_to_consenting_patients(unfiltered_consent_df, 'consent')
print("Unfiltered patient consent table")
print(unfiltered_consent_df)
print('\n')
print("Table of patients filtered to those who consent to their data being analysed")
print(consenting_patients_df)
Unfiltered patient consent table
patient_id consent
0 1001 True
1 1002 True
2 1003 True
3 1004 True
4 1005 False
5 1006 True
6 1007 True
7 1008 True
Table of patients filtered to those who consent to their data being analysed
patient_id consent
0 1001 True
1 1002 True
2 1003 True
3 1004 True
5 1006 True
6 1007 True
7 1008 True
Clean null name values
import pandas as pd
name_uncleaned_schema = ['patient_id', 'name']
uncleaned_name_data = [[1001, 'alan'],[1002, 'BOB'],[1003, 'carla'],[1004,'David'],[1005, None],[1006, 'frances'],[1007, 'gReg'],[1008,None]]
uncleaned_name_df = pd.DataFrame(uncleaned_name_data, columns = name_uncleaned_schema)
def clean_null_name_values(df, column_to_standardise):
'''
Converts encodings of null to 'unspecified'
'''
df['cleaned_name'] = df[column_to_standardise].fillna('Unspecified')
return df
cleaned_df = clean_null_name_values(uncleaned_name_df, 'name')
print("Uncleaned name table")
print(uncleaned_name_df)
print('\n')
print("Cleaned name table")
print(cleaned_df)
Uncleaned name table
patient_id name cleaned_name
0 1001 alan alan
1 1002 BOB BOB
2 1003 carla carla
3 1004 David David
4 1005 None Unspecified
5 1006 frances frances
6 1007 gReg gReg
7 1008 None Unspecified
Cleaned name table
patient_id name cleaned_name
0 1001 alan alan
1 1002 BOB BOB
2 1003 carla carla
3 1004 David David
4 1005 None Unspecified
5 1006 frances frances
6 1007 gReg gReg
7 1008 None Unspecified
In this example, any null gender values are recoded to 'Unspecified'.
If you have any ideas or feedback you'd like to give the team, feel free to contact us
Created: September 20, 2024