Defining a downstream task
It is likely that a synthetic dataset may be associated with specific modelling efforts or metrics that are not included in the general suite of evaluation tools supported more explicitly by this package. Additionally, analyses on model outputs for bias and fairness provided via Aequitas require some basis of predictions on which to perform the analysis. For these reasons, we provide a simple interface for defining a custom downstream task.
All downstream tasks are to be located in a folder named tasks in the working directory of the project, with subfolders for each dataset, i.e. the tasks associated with the support dataset should be located in the tasks/support directory.
The interface is then quite simple:
- There should be a function called
runthat takes a single argument:dataset(additional arguments could be provided with some further configuration if there is a need for this) - The
runfunction should fit a model and / or calculate some metric(s) on the dataset. - It should then return predicted probabilities for the outcome variable(s) in the dataset and a dictionary of metrics.
- The file should contain a top-level variable containing an instantiation of the
nhssynthTaskclass.
See the example below of a logistic regression model fit on the support dataset with the event variable as the outcome and rocauc as the metric of interest:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from nhssynth.modules.evaluation.tasks import Task
def run(dataset: pd.DataFrame) -> tuple[pd.DataFrame, dict]:
# Split the dataset into features and target
target = "event"
data = dataset.dropna()
X, y = data.drop(["dob", "x3", target], axis=1), data[target]
X_train, X_test, y_train, y_test = train_test_split(
StandardScaler().fit_transform(X), y, test_size=0.33, random_state=42
)
lr = LogisticRegression()
lr.fit(X_train, y_train)
# Get the predicted probabilities and predictions
probs = pd.DataFrame(lr.predict_proba(X_test)[:, 1], columns=[f"lr_{target}_prob"])
rocauc = roc_auc_score(y_test, probs)
return probs, {"rocauc_lr": rocauc}
task = Task("Logistic Regression on 'event'", run, supports_aequitas=True)
Note the highlighted lines above:
- The
Taskclass has been imported fromnhssynth.modules.evaluations.tasks - The
runfunction should accept one argument and return a tuple - The second element of this tuple should be a dictionary labelling each metric of interest (this name will be used in the dashboard as identification so ensure it is unique to the experiment)
- The
taskshould be instantiated with a name, therunfunction and a boolean indicating whether the task supports Aequitas analysis, if the task does not support Aequitas analysis, then the first element of the tuple will not be used andNonecan be returned instead.
The rest of this file can contain any arbitrary code that runs within these constraints, this could be a simple model as above, or a more complex pipeline of transformations and models to match a pre-existing workflow.