Skip to content

Evaluation

Modules:

Name Description
evaluator
evaluators
experiment

Classes:

Name Description
EvaluationRecord

A record identifying evaluations for a specific task.

Evaluator

A class for LLM output evaluators.

ExperimentBatchConfig

Configuration for a batch of experiments to be executed by a pipeline.

ExperimentConfig

Configuration for an experiment to be executed by a pipeline.

GenerationRecord

A record identifying generations for a specific task.

PerturbationGroupedRecord

A record grouping evaluation records by generator name

ResultRecord

A record indicating the result of generation or evaluation.

ScoreCalculator

A protocol for computing evaluation scores.

ScorerFactory

A protocol for constructing a Scorer given a Model.

TaskConfig

Configuration for a task to be executed by a pipeline.

EvaluationRecord

Bases: GenerationRecord

A record identifying evaluations for a specific task.

Attributes:

Name Type Description
dataset_record DatasetRecord

The record of the dataset.

generator_name str

The name of the generator.

task_name str

The name of the task.

model_record ModelRecord

The record of the model.

experiment_name str | None

The name of the experiment, if applicable.

evaluator_name str

The name of the evaluator.

Methods:

Name Description
__eq__

Checks equality with another record.

__hash__

Generates a hash for the evaluation record.

__lt__

Checks if this record is less than another record.

get_evaluation_record

Generates an evaluation record from the generation record.

get_perturbation_grouped_record

Generates a perturbation grouped record from the evaluation record.

Source code in evalsense/evaluation/experiment.py
@total_ordering
class EvaluationRecord(GenerationRecord, frozen=True):
    """A record identifying evaluations for a specific task.

    Attributes:
        dataset_record (DatasetRecord): The record of the dataset.
        generator_name (str): The name of the generator.
        task_name (str): The name of the task.
        model_record (ModelRecord): The record of the model.
        experiment_name (str | None): The name of the experiment, if applicable.
        evaluator_name (str): The name of the evaluator.
    """

    evaluator_name: str

    @property
    def generation_record(self) -> GenerationRecord:
        """Generates a generation record from the evaluation record.

        Returns:
            GenerationRecord: The generation record.
        """
        return GenerationRecord(
            **self.model_dump(exclude={"evaluator_name"}),
        )

    def get_perturbation_grouped_record(
        self, metric_name: str
    ) -> "PerturbationGroupedRecord":
        """Generates a perturbation grouped record from the evaluation record.

        Args:
            metric_name (str): The name of the metric being evaluated.

        Returns:
            PerturbationGroupedRecord: The perturbation grouped record.
        """
        return PerturbationGroupedRecord(
            **self.model_dump(exclude={"generator_name"}),
            generator_name="",
            metric_name=metric_name,
        )

    @property
    def label(self) -> str:
        """Generates a label for the evaluation record.

        Returns:
            str: The label for the evaluation record.
        """
        return (
            f"{self.dataset_record.name} | {self.task_name} | {self.generator_name} | "
            f"{self.model_record.name} | {self.evaluator_name}"
        )

    def __eq__(self, other: object) -> bool:
        """Checks equality with another record.

        Args:
            other (object): The other record to compare with.

        Returns:
            bool: True if the records are equal, False otherwise.
        """
        if not isinstance(other, EvaluationRecord) or type(self) is not type(other):
            return NotImplemented
        generation_equal = super().__eq__(other)
        if generation_equal is NotImplemented:
            return generation_equal
        return generation_equal and self.evaluator_name == other.evaluator_name

    def __lt__(self, other: object) -> bool:
        """Checks if this record is less than another record.

        Args:
            other (object): The other record to compare with.

        Returns:
            bool: True if this record is less than the other, False otherwise.
        """
        if not isinstance(other, EvaluationRecord) or type(self) is not type(other):
            return NotImplemented
        if super().__lt__(other):
            return True
        elif super().__eq__(other):
            return self.evaluator_name < other.evaluator_name
        else:
            return False

    def __hash__(self) -> int:
        """Generates a hash for the evaluation record.

        Returns:
            int: The hash of the evaluation record.
        """
        return hash(
            (
                self.dataset_record,
                self.generator_name,
                self.task_name,
                self.model_record,
                self.experiment_name,
                self.evaluator_name,
            )
        )

generation_record property

generation_record: GenerationRecord

Generates a generation record from the evaluation record.

Returns:

Name Type Description
GenerationRecord GenerationRecord

The generation record.

label property

label: str

Generates a label for the evaluation record.

Returns:

Name Type Description
str str

The label for the evaluation record.

__eq__

__eq__(other: object) -> bool

Checks equality with another record.

Parameters:

Name Type Description Default
other object

The other record to compare with.

required

Returns:

Name Type Description
bool bool

True if the records are equal, False otherwise.

Source code in evalsense/evaluation/experiment.py
def __eq__(self, other: object) -> bool:
    """Checks equality with another record.

    Args:
        other (object): The other record to compare with.

    Returns:
        bool: True if the records are equal, False otherwise.
    """
    if not isinstance(other, EvaluationRecord) or type(self) is not type(other):
        return NotImplemented
    generation_equal = super().__eq__(other)
    if generation_equal is NotImplemented:
        return generation_equal
    return generation_equal and self.evaluator_name == other.evaluator_name

__hash__

__hash__() -> int

Generates a hash for the evaluation record.

Returns:

Name Type Description
int int

The hash of the evaluation record.

Source code in evalsense/evaluation/experiment.py
def __hash__(self) -> int:
    """Generates a hash for the evaluation record.

    Returns:
        int: The hash of the evaluation record.
    """
    return hash(
        (
            self.dataset_record,
            self.generator_name,
            self.task_name,
            self.model_record,
            self.experiment_name,
            self.evaluator_name,
        )
    )

__lt__

__lt__(other: object) -> bool

Checks if this record is less than another record.

Parameters:

Name Type Description Default
other object

The other record to compare with.

required

Returns:

Name Type Description
bool bool

True if this record is less than the other, False otherwise.

Source code in evalsense/evaluation/experiment.py
def __lt__(self, other: object) -> bool:
    """Checks if this record is less than another record.

    Args:
        other (object): The other record to compare with.

    Returns:
        bool: True if this record is less than the other, False otherwise.
    """
    if not isinstance(other, EvaluationRecord) or type(self) is not type(other):
        return NotImplemented
    if super().__lt__(other):
        return True
    elif super().__eq__(other):
        return self.evaluator_name < other.evaluator_name
    else:
        return False

get_evaluation_record

get_evaluation_record(
    evaluator_name: str,
) -> EvaluationRecord

Generates an evaluation record from the generation record.

Parameters:

Name Type Description Default
evaluator_name str

The name of the evaluator.

required

Returns:

Name Type Description
EvaluationRecord EvaluationRecord

The evaluation record.

Source code in evalsense/evaluation/experiment.py
def get_evaluation_record(self, evaluator_name: str) -> "EvaluationRecord":
    """Generates an evaluation record from the generation record.

    Args:
        evaluator_name (str): The name of the evaluator.

    Returns:
        EvaluationRecord: The evaluation record.
    """
    return EvaluationRecord(
        **self.model_dump(),
        evaluator_name=evaluator_name,
    )

get_perturbation_grouped_record

get_perturbation_grouped_record(
    metric_name: str,
) -> PerturbationGroupedRecord

Generates a perturbation grouped record from the evaluation record.

Parameters:

Name Type Description Default
metric_name str

The name of the metric being evaluated.

required

Returns:

Name Type Description
PerturbationGroupedRecord PerturbationGroupedRecord

The perturbation grouped record.

Source code in evalsense/evaluation/experiment.py
def get_perturbation_grouped_record(
    self, metric_name: str
) -> "PerturbationGroupedRecord":
    """Generates a perturbation grouped record from the evaluation record.

    Args:
        metric_name (str): The name of the metric being evaluated.

    Returns:
        PerturbationGroupedRecord: The perturbation grouped record.
    """
    return PerturbationGroupedRecord(
        **self.model_dump(exclude={"generator_name"}),
        generator_name="",
        metric_name=metric_name,
    )

Evaluator dataclass

A class for LLM output evaluators.

Attributes:

Name Type Description
model_name str

Retrieves the model name associated with the evaluator config.

Source code in evalsense/evaluation/evaluator.py
@dataclass
class Evaluator:
    """A class for LLM output evaluators."""

    name: str
    scorer: Scorer | ScorerFactory
    model_config: ModelConfig | None = None
    cleanup_fun: Callable[[], None] | None = None

    @property
    def model_name(self) -> str:
        """Retrieves the model name associated with the evaluator config.

        Returns an empty string if the evaluator doesn't use a model config.

        Returns:
            str: The name of the model in the config or empty string.
        """
        if self.model_config is None:
            return ""
        return self.model_config.name

model_name property

model_name: str

Retrieves the model name associated with the evaluator config.

Returns an empty string if the evaluator doesn't use a model config.

Returns:

Name Type Description
str str

The name of the model in the config or empty string.

ExperimentBatchConfig dataclass

Configuration for a batch of experiments to be executed by a pipeline.

Methods:

Name Description
validate

Validates the experiment configuration.

Attributes:

Name Type Description
all_experiments list[ExperimentConfig]

Generates a list of all experiments in the batch.

Source code in evalsense/evaluation/experiment.py
@dataclass
class ExperimentBatchConfig:
    """Configuration for a batch of experiments to be executed by a pipeline."""

    tasks: list[TaskConfig]
    model_configs: list[ModelConfig]
    evaluators: list[Evaluator] = field(default_factory=list)
    name: str | None = None

    def validate(self) -> None:
        """Validates the experiment configuration.

        Raises:
            ValueError: If the configuration is invalid.
        """
        if not self.tasks:
            raise ValueError("Experiment must have at least one task.")
        if not self.model_configs:
            raise ValueError("Experiment must have at least one LLM manager.")

    @property
    def all_experiments(self) -> list[ExperimentConfig]:
        """Generates a list of all experiments in the batch.

        Returns:
            list[ExperimentConfig]: A list of all experiments in the batch.
        """
        experiments = []
        for task in self.tasks:
            for llm_manager in self.model_configs:
                if self.evaluators:
                    for evaluator in self.evaluators:
                        experiments.append(
                            ExperimentConfig(
                                dataset_manager=task.dataset_manager,
                                generation_steps=task.generation_steps,
                                field_spec=task.field_spec,
                                task_preprocessor=task.task_preprocessor,
                                model_config=llm_manager,
                                evaluator=evaluator,
                                name=self.name,
                            )
                        )
                else:
                    experiments.append(
                        ExperimentConfig(
                            dataset_manager=task.dataset_manager,
                            generation_steps=task.generation_steps,
                            field_spec=task.field_spec,
                            task_preprocessor=task.task_preprocessor,
                            model_config=llm_manager,
                            name=self.name,
                        )
                    )
        return experiments

all_experiments property

all_experiments: list[ExperimentConfig]

Generates a list of all experiments in the batch.

Returns:

Type Description
list[ExperimentConfig]

list[ExperimentConfig]: A list of all experiments in the batch.

validate

validate() -> None

Validates the experiment configuration.

Raises:

Type Description
ValueError

If the configuration is invalid.

Source code in evalsense/evaluation/experiment.py
def validate(self) -> None:
    """Validates the experiment configuration.

    Raises:
        ValueError: If the configuration is invalid.
    """
    if not self.tasks:
        raise ValueError("Experiment must have at least one task.")
    if not self.model_configs:
        raise ValueError("Experiment must have at least one LLM manager.")

ExperimentConfig dataclass

Configuration for an experiment to be executed by a pipeline.

Attributes:

Name Type Description
evaluation_record EvaluationRecord

A identifying evaluations for a specific task.

generation_record GenerationRecord

A identifying generations for a specific task.

Source code in evalsense/evaluation/experiment.py
@dataclass
class ExperimentConfig:
    """Configuration for an experiment to be executed by a pipeline."""

    dataset_manager: DatasetManager
    generation_steps: GenerationSteps
    model_config: ModelConfig
    field_spec: FieldSpec | RecordToSample | None = None
    task_preprocessor: TaskPreprocessor = field(
        default_factory=lambda: DefaultTaskPreprocessor()
    )
    evaluator: Evaluator | None = None
    name: str | None = None

    @property
    def generation_record(self) -> GenerationRecord:
        """A identifying generations for a specific task.

        Returns:
            GenerationsRecord: A record of the generations for the experiment.
        """
        return GenerationRecord(
            dataset_record=self.dataset_manager.record,
            generator_name=self.generation_steps.name,
            task_name=self.task_preprocessor.name,
            model_record=self.model_config.record,
            experiment_name=self.name,
        )

    @property
    def evaluation_record(self) -> EvaluationRecord:
        """A identifying evaluations for a specific task.

        Returns:
            EvaluationRecord: A record of the evaluations for the experiment.
        """
        if self.evaluator is None:
            raise ValueError(
                "Cannot get evaluation record for an experiment without an evaluator"
            )

        return self.generation_record.get_evaluation_record(
            self.evaluator.name,
        )

evaluation_record property

evaluation_record: EvaluationRecord

A identifying evaluations for a specific task.

Returns:

Name Type Description
EvaluationRecord EvaluationRecord

A record of the evaluations for the experiment.

generation_record property

generation_record: GenerationRecord

A identifying generations for a specific task.

Returns:

Name Type Description
GenerationsRecord GenerationRecord

A record of the generations for the experiment.

GenerationRecord

Bases: BaseModel

A record identifying generations for a specific task.

Attributes:

Name Type Description
dataset_record DatasetRecord

The record of the dataset.

generator_name str

The name of the generator.

task_name str

The name of the task.

model_record ModelRecord

The record of the model.

experiment_name str | None

The name of the experiment, if applicable.

Methods:

Name Description
__eq__

Checks equality with another record.

__hash__

Generates a hash for the generation record.

__lt__

Checks if this record is less than another record.

get_evaluation_record

Generates an evaluation record from the generation record.

Source code in evalsense/evaluation/experiment.py
@total_ordering
class GenerationRecord(BaseModel, frozen=True):
    """A record identifying generations for a specific task.

    Attributes:
        dataset_record (DatasetRecord): The record of the dataset.
        generator_name (str): The name of the generator.
        task_name (str): The name of the task.
        model_record (ModelRecord): The record of the model.
        experiment_name (str | None): The name of the experiment, if applicable.
    """

    dataset_record: DatasetRecord
    generator_name: str
    task_name: str
    model_record: ModelRecord
    experiment_name: str | None = None

    def get_evaluation_record(self, evaluator_name: str) -> "EvaluationRecord":
        """Generates an evaluation record from the generation record.

        Args:
            evaluator_name (str): The name of the evaluator.

        Returns:
            EvaluationRecord: The evaluation record.
        """
        return EvaluationRecord(
            **self.model_dump(),
            evaluator_name=evaluator_name,
        )

    @property
    def label(self) -> str:
        """Generates a label for the generation record.

        Returns:
            str: The label for the generation record.
        """
        return (
            f"{self.dataset_record.name} | {self.task_name} | "
            f"{self.generator_name} | {self.model_record.name}"
        )

    def __eq__(self, other: object) -> bool:
        """Checks equality with another record.

        Args:
            other (object): The other record to compare with.

        Returns:
            bool: True if the records are equal, False otherwise.
        """
        if not isinstance(other, GenerationRecord) or type(self) is not type(other):
            return NotImplemented
        return (
            self.dataset_record == other.dataset_record
            and self.generator_name == other.generator_name
            and self.task_name == other.task_name
            and self.model_record == other.model_record
            and self.experiment_name == other.experiment_name
        )

    def __lt__(self, other: object) -> bool:
        """Checks if this record is less than another record.

        Args:
            other (object): The other record to compare with.

        Returns:
            bool: True if this record is less than the other, False otherwise.
        """
        if not isinstance(other, GenerationRecord) or type(self) is not type(other):
            return NotImplemented
        return (
            self.dataset_record,
            self.generator_name,
            self.task_name,
            self.model_record,
            self.experiment_name or "",
        ) < (
            other.dataset_record,
            other.generator_name,
            other.task_name,
            other.model_record,
            other.experiment_name or "",
        )

    def __hash__(self) -> int:
        """Generates a hash for the generation record.

        Returns:
            int: The hash of the generation record.
        """
        return hash(
            (
                self.dataset_record,
                self.generator_name,
                self.task_name,
                self.model_record,
                self.experiment_name,
            )
        )

label property

label: str

Generates a label for the generation record.

Returns:

Name Type Description
str str

The label for the generation record.

__eq__

__eq__(other: object) -> bool

Checks equality with another record.

Parameters:

Name Type Description Default
other object

The other record to compare with.

required

Returns:

Name Type Description
bool bool

True if the records are equal, False otherwise.

Source code in evalsense/evaluation/experiment.py
def __eq__(self, other: object) -> bool:
    """Checks equality with another record.

    Args:
        other (object): The other record to compare with.

    Returns:
        bool: True if the records are equal, False otherwise.
    """
    if not isinstance(other, GenerationRecord) or type(self) is not type(other):
        return NotImplemented
    return (
        self.dataset_record == other.dataset_record
        and self.generator_name == other.generator_name
        and self.task_name == other.task_name
        and self.model_record == other.model_record
        and self.experiment_name == other.experiment_name
    )

__hash__

__hash__() -> int

Generates a hash for the generation record.

Returns:

Name Type Description
int int

The hash of the generation record.

Source code in evalsense/evaluation/experiment.py
def __hash__(self) -> int:
    """Generates a hash for the generation record.

    Returns:
        int: The hash of the generation record.
    """
    return hash(
        (
            self.dataset_record,
            self.generator_name,
            self.task_name,
            self.model_record,
            self.experiment_name,
        )
    )

__lt__

__lt__(other: object) -> bool

Checks if this record is less than another record.

Parameters:

Name Type Description Default
other object

The other record to compare with.

required

Returns:

Name Type Description
bool bool

True if this record is less than the other, False otherwise.

Source code in evalsense/evaluation/experiment.py
def __lt__(self, other: object) -> bool:
    """Checks if this record is less than another record.

    Args:
        other (object): The other record to compare with.

    Returns:
        bool: True if this record is less than the other, False otherwise.
    """
    if not isinstance(other, GenerationRecord) or type(self) is not type(other):
        return NotImplemented
    return (
        self.dataset_record,
        self.generator_name,
        self.task_name,
        self.model_record,
        self.experiment_name or "",
    ) < (
        other.dataset_record,
        other.generator_name,
        other.task_name,
        other.model_record,
        other.experiment_name or "",
    )

get_evaluation_record

get_evaluation_record(
    evaluator_name: str,
) -> EvaluationRecord

Generates an evaluation record from the generation record.

Parameters:

Name Type Description Default
evaluator_name str

The name of the evaluator.

required

Returns:

Name Type Description
EvaluationRecord EvaluationRecord

The evaluation record.

Source code in evalsense/evaluation/experiment.py
def get_evaluation_record(self, evaluator_name: str) -> "EvaluationRecord":
    """Generates an evaluation record from the generation record.

    Args:
        evaluator_name (str): The name of the evaluator.

    Returns:
        EvaluationRecord: The evaluation record.
    """
    return EvaluationRecord(
        **self.model_dump(),
        evaluator_name=evaluator_name,
    )

PerturbationGroupedRecord

Bases: EvaluationRecord

A record grouping evaluation records by generator name (as generator name specifies the perturbation tier).

Attributes:

Name Type Description
dataset_record DatasetRecord

The record of the dataset.

generator_name str

The name of the generator.

task_name str

The name of the task.

model_record ModelRecord

The record of the model.

experiment_name str | None

The name of the experiment, if applicable.

evaluator_name str

The name of the evaluator.

metric_name str

The name of the metric being evaluated.

Methods:

Name Description
__eq__

Checks equality with another record.

__hash__

Generates a hash for the perturbation grouped record.

__lt__

Checks if this record is less than another record.

get_evaluation_record

Generates an evaluation record from the generation record.

get_perturbation_grouped_record

Generates a perturbation grouped record from the evaluation record.

Source code in evalsense/evaluation/experiment.py
@total_ordering
class PerturbationGroupedRecord(EvaluationRecord, frozen=True):
    """A record grouping evaluation records by generator name
    (as generator name specifies the perturbation tier).

    Attributes:
        dataset_record (DatasetRecord): The record of the dataset.
        generator_name (str): The name of the generator.
        task_name (str): The name of the task.
        model_record (ModelRecord): The record of the model.
        experiment_name (str | None): The name of the experiment, if applicable.
        evaluator_name (str): The name of the evaluator.
        metric_name (str): The name of the metric being evaluated.
    """

    metric_name: str

    def __eq__(self, other: object) -> bool:
        """Checks equality with another record.

        Args:
            other (object): The other record to compare with.

        Returns:
            bool: True if the records are equal, False otherwise.
        """
        if not isinstance(other, PerturbationGroupedRecord) or type(self) is not type(
            other
        ):
            return NotImplemented
        generation_equal = super().__eq__(other)
        if generation_equal is NotImplemented:
            return generation_equal
        return generation_equal and self.metric_name == other.metric_name

    def __lt__(self, other: object) -> bool:
        """Checks if this record is less than another record.

        Args:
            other (object): The other record to compare with.

        Returns:
            bool: True if this record is less than the other, False otherwise.
        """
        if not isinstance(other, PerturbationGroupedRecord) or type(self) is not type(
            other
        ):
            return NotImplemented
        if super().__lt__(other):
            return True
        elif super().__eq__(other):
            return self.metric_name < other.metric_name
        else:
            return False

    def __hash__(self) -> int:
        """Generates a hash for the perturbation grouped record.

        Returns:
            int: The hash of the perturbation grouped record.
        """
        return hash(
            (
                self.dataset_record,
                self.generator_name,
                self.task_name,
                self.model_record,
                self.experiment_name,
                self.evaluator_name,
                self.metric_name,
            )
        )

generation_record property

generation_record: GenerationRecord

Generates a generation record from the evaluation record.

Returns:

Name Type Description
GenerationRecord GenerationRecord

The generation record.

label property

label: str

Generates a label for the evaluation record.

Returns:

Name Type Description
str str

The label for the evaluation record.

__eq__

__eq__(other: object) -> bool

Checks equality with another record.

Parameters:

Name Type Description Default
other object

The other record to compare with.

required

Returns:

Name Type Description
bool bool

True if the records are equal, False otherwise.

Source code in evalsense/evaluation/experiment.py
def __eq__(self, other: object) -> bool:
    """Checks equality with another record.

    Args:
        other (object): The other record to compare with.

    Returns:
        bool: True if the records are equal, False otherwise.
    """
    if not isinstance(other, PerturbationGroupedRecord) or type(self) is not type(
        other
    ):
        return NotImplemented
    generation_equal = super().__eq__(other)
    if generation_equal is NotImplemented:
        return generation_equal
    return generation_equal and self.metric_name == other.metric_name

__hash__

__hash__() -> int

Generates a hash for the perturbation grouped record.

Returns:

Name Type Description
int int

The hash of the perturbation grouped record.

Source code in evalsense/evaluation/experiment.py
def __hash__(self) -> int:
    """Generates a hash for the perturbation grouped record.

    Returns:
        int: The hash of the perturbation grouped record.
    """
    return hash(
        (
            self.dataset_record,
            self.generator_name,
            self.task_name,
            self.model_record,
            self.experiment_name,
            self.evaluator_name,
            self.metric_name,
        )
    )

__lt__

__lt__(other: object) -> bool

Checks if this record is less than another record.

Parameters:

Name Type Description Default
other object

The other record to compare with.

required

Returns:

Name Type Description
bool bool

True if this record is less than the other, False otherwise.

Source code in evalsense/evaluation/experiment.py
def __lt__(self, other: object) -> bool:
    """Checks if this record is less than another record.

    Args:
        other (object): The other record to compare with.

    Returns:
        bool: True if this record is less than the other, False otherwise.
    """
    if not isinstance(other, PerturbationGroupedRecord) or type(self) is not type(
        other
    ):
        return NotImplemented
    if super().__lt__(other):
        return True
    elif super().__eq__(other):
        return self.metric_name < other.metric_name
    else:
        return False

get_evaluation_record

get_evaluation_record(
    evaluator_name: str,
) -> EvaluationRecord

Generates an evaluation record from the generation record.

Parameters:

Name Type Description Default
evaluator_name str

The name of the evaluator.

required

Returns:

Name Type Description
EvaluationRecord EvaluationRecord

The evaluation record.

Source code in evalsense/evaluation/experiment.py
def get_evaluation_record(self, evaluator_name: str) -> "EvaluationRecord":
    """Generates an evaluation record from the generation record.

    Args:
        evaluator_name (str): The name of the evaluator.

    Returns:
        EvaluationRecord: The evaluation record.
    """
    return EvaluationRecord(
        **self.model_dump(),
        evaluator_name=evaluator_name,
    )

get_perturbation_grouped_record

get_perturbation_grouped_record(
    metric_name: str,
) -> PerturbationGroupedRecord

Generates a perturbation grouped record from the evaluation record.

Parameters:

Name Type Description Default
metric_name str

The name of the metric being evaluated.

required

Returns:

Name Type Description
PerturbationGroupedRecord PerturbationGroupedRecord

The perturbation grouped record.

Source code in evalsense/evaluation/experiment.py
def get_perturbation_grouped_record(
    self, metric_name: str
) -> "PerturbationGroupedRecord":
    """Generates a perturbation grouped record from the evaluation record.

    Args:
        metric_name (str): The name of the metric being evaluated.

    Returns:
        PerturbationGroupedRecord: The perturbation grouped record.
    """
    return PerturbationGroupedRecord(
        **self.model_dump(exclude={"generator_name"}),
        generator_name="",
        metric_name=metric_name,
    )

ResultRecord

Bases: BaseModel

A record indicating the result of generation or evaluation.

Attributes:

Name Type Description
status RecordStatus

The status of the record.

error_message str | None

The error message, if any.

log_location str | None

The location of the associated Inspect log file.

Source code in evalsense/evaluation/experiment.py
class ResultRecord(BaseModel, frozen=True):
    """A record indicating the result of generation or evaluation.

    Attributes:
        status (RecordStatus): The status of the record.
        error_message (str | None): The error message, if any.
        log_location (str | None): The location of the associated Inspect log file.
    """

    status: RecordStatus = "started"
    error_message: str | None = None
    log_location: str | None = None

ScoreCalculator

Bases: Protocol

A protocol for computing evaluation scores.

Methods:

Name Description
calculate

Computes evaluation scores for the given evaluation method

calculate_async

Asynchronously computes evaluation scores for the given evaluation method

Source code in evalsense/evaluation/evaluator.py
@runtime_checkable
class ScoreCalculator(Protocol):
    """A protocol for computing evaluation scores."""

    @abstractmethod
    def calculate(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        **kwargs: dict,
    ) -> Score:
        """Computes evaluation scores for the given evaluation method

        Args:
            prediction (str): The model output to evaluate.
            input (str, optional): The input to the model. Optional.
            reference (str, optional): The reference output to compare against.
                Optional.
            metadata (dict[str, Any], optional): Additional Inspect AI sample/task
                state metadata. Optional.
            **kwargs (dict): Additional keyword arguments specific to the given
                evaluation method.

        Returns:
            Score: The Inspect AI Score object with the calculated result.
        """
        pass

    @abstractmethod
    async def calculate_async(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        **kwargs: dict,
    ) -> Score:
        """Asynchronously computes evaluation scores for the given evaluation method

        Args:
            prediction (str): The model output to evaluate.
            input (str, optional): The input to the model. Optional.
            reference (str, optional): The reference output to compare against.
                Optional.
            metadata (dict[str, Any], optional): Additional Inspect AI sample/task
                state metadata. Optional.
            **kwargs (dict): Additional keyword arguments specific to the given
                evaluation method.

        Returns:
            Score: The Inspect AI Score object with the calculated result.
        """
        pass

calculate abstractmethod

calculate(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score

Computes evaluation scores for the given evaluation method

Parameters:

Name Type Description Default
prediction str

The model output to evaluate.

required
input str

The input to the model. Optional.

None
reference str

The reference output to compare against. Optional.

None
metadata dict[str, Any]

Additional Inspect AI sample/task state metadata. Optional.

None
**kwargs dict

Additional keyword arguments specific to the given evaluation method.

{}

Returns:

Name Type Description
Score Score

The Inspect AI Score object with the calculated result.

Source code in evalsense/evaluation/evaluator.py
@abstractmethod
def calculate(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score:
    """Computes evaluation scores for the given evaluation method

    Args:
        prediction (str): The model output to evaluate.
        input (str, optional): The input to the model. Optional.
        reference (str, optional): The reference output to compare against.
            Optional.
        metadata (dict[str, Any], optional): Additional Inspect AI sample/task
            state metadata. Optional.
        **kwargs (dict): Additional keyword arguments specific to the given
            evaluation method.

    Returns:
        Score: The Inspect AI Score object with the calculated result.
    """
    pass

calculate_async abstractmethod async

calculate_async(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score

Asynchronously computes evaluation scores for the given evaluation method

Parameters:

Name Type Description Default
prediction str

The model output to evaluate.

required
input str

The input to the model. Optional.

None
reference str

The reference output to compare against. Optional.

None
metadata dict[str, Any]

Additional Inspect AI sample/task state metadata. Optional.

None
**kwargs dict

Additional keyword arguments specific to the given evaluation method.

{}

Returns:

Name Type Description
Score Score

The Inspect AI Score object with the calculated result.

Source code in evalsense/evaluation/evaluator.py
@abstractmethod
async def calculate_async(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score:
    """Asynchronously computes evaluation scores for the given evaluation method

    Args:
        prediction (str): The model output to evaluate.
        input (str, optional): The input to the model. Optional.
        reference (str, optional): The reference output to compare against.
            Optional.
        metadata (dict[str, Any], optional): Additional Inspect AI sample/task
            state metadata. Optional.
        **kwargs (dict): Additional keyword arguments specific to the given
            evaluation method.

    Returns:
        Score: The Inspect AI Score object with the calculated result.
    """
    pass

ScorerFactory

Bases: Protocol

A protocol for constructing a Scorer given a Model.

Methods:

Name Description
create_scorer

Creates a Scorer from a Model.

Source code in evalsense/evaluation/evaluator.py
@runtime_checkable
class ScorerFactory(Protocol):
    """A protocol for constructing a Scorer given a Model."""

    @abstractmethod
    def create_scorer(self, model: Model) -> Scorer:
        """Creates a Scorer from a Model.

        Args:
            model (Model): The model to create a scorer for.

        Returns:
            Scorer: The created scorer.
        """
        pass

create_scorer abstractmethod

create_scorer(model: Model) -> Scorer

Creates a Scorer from a Model.

Parameters:

Name Type Description Default
model Model

The model to create a scorer for.

required

Returns:

Name Type Description
Scorer Scorer

The created scorer.

Source code in evalsense/evaluation/evaluator.py
@abstractmethod
def create_scorer(self, model: Model) -> Scorer:
    """Creates a Scorer from a Model.

    Args:
        model (Model): The model to create a scorer for.

    Returns:
        Scorer: The created scorer.
    """
    pass

TaskConfig dataclass

Configuration for a task to be executed by a pipeline.

Source code in evalsense/evaluation/experiment.py
@dataclass
class TaskConfig:
    """Configuration for a task to be executed by a pipeline."""

    dataset_manager: DatasetManager
    generation_steps: GenerationSteps
    field_spec: FieldSpec | RecordToSample | None = None
    task_preprocessor: TaskPreprocessor = field(
        default_factory=lambda: DefaultTaskPreprocessor()
    )