Evaluation

Modules:

Name	Description
`evaluator`
`evaluators`
`experiment`

Classes:

Name	Description
`EvaluationRecord`	A record identifying evaluations for a specific task.
`Evaluator`	A class for LLM output evaluators.
`ExperimentBatchConfig`	Configuration for a batch of experiments to be executed by a pipeline.
`ExperimentConfig`	Configuration for an experiment to be executed by a pipeline.
`GenerationRecord`	A record identifying generations for a specific task.
`MetaTierGroupedRecord`	A record grouping evaluation records by generator name
`ResultRecord`	A record indicating the result of generation or evaluation.
`ScoreCalculator`	A protocol for computing evaluation scores.
`ScorerFactory`	A protocol for constructing a Scorer given a Model.
`TaskConfig`	Configuration for a task to be executed by a pipeline.

EvaluationRecord

Bases: GenerationRecord

A record identifying evaluations for a specific task.

Attributes:

Name	Type	Description
`dataset_record`	`DatasetRecord`	The record of the dataset.
`generator_name`	`str`	The name of the generator.
`task_name`	`str`	The name of the task.
`model_record`	`ModelRecord`	The record of the model.
`experiment_name`	`str \| None`	The name of the experiment, if applicable.
`evaluator_name`	`str`	The name of the evaluator.

Methods:

Name	Description
`__eq__`	Checks equality with another record.
`__hash__`	Generates a hash for the evaluation record.
`__lt__`	Checks if this record is less than another record.
`get_evaluation_record`	Generates an evaluation record from the generation record.
`get_meta_grouped_record`	Generates a perturbation grouped record from the evaluation record.

Source code in evalsense/evaluation/experiment.py

@total_ordering
class EvaluationRecord(GenerationRecord, frozen=True):
    """A record identifying evaluations for a specific task.

    Attributes:
        dataset_record (DatasetRecord): The record of the dataset.
        generator_name (str): The name of the generator.
        task_name (str): The name of the task.
        model_record (ModelRecord): The record of the model.
        experiment_name (str | None): The name of the experiment, if applicable.
        evaluator_name (str): The name of the evaluator.
    """

    evaluator_name: str

    @property
    def generation_record(self) -> GenerationRecord:
        """Generates a generation record from the evaluation record.

        Returns:
            GenerationRecord: The generation record.
        """
        return GenerationRecord(
            **self.model_dump(exclude={"evaluator_name"}),
        )

    def get_meta_grouped_record(self, metric_name: str) -> "MetaTierGroupedRecord":
        """Generates a perturbation grouped record from the evaluation record.

        Args:
            metric_name (str): The name of the metric being evaluated.

        Returns:
            PerturbationGroupedRecord: The perturbation grouped record.
        """
        return MetaTierGroupedRecord(
            **self.model_dump(exclude={"generator_name"}),
            generator_name="",
            metric_name=metric_name,
        )

    @property
    def label(self) -> str:
        """Generates a label for the evaluation record.

        Returns:
            str: The label for the evaluation record.
        """
        return (
            f"{self.dataset_record.name} | {self.task_name} | {self.generator_name} | "
            f"{self.model_record.name} | {self.evaluator_name}"
        )

    def __eq__(self, other: object) -> bool:
        """Checks equality with another record.

        Args:
            other (object): The other record to compare with.

        Returns:
            bool: True if the records are equal, False otherwise.
        """
        if not isinstance(other, EvaluationRecord) or type(self) is not type(other):
            return NotImplemented
        generation_equal = super().__eq__(other)
        if generation_equal is NotImplemented:
            return generation_equal
        return generation_equal and self.evaluator_name == other.evaluator_name

    def __lt__(self, other: object) -> bool:
        """Checks if this record is less than another record.

        Args:
            other (object): The other record to compare with.

        Returns:
            bool: True if this record is less than the other, False otherwise.
        """
        if not isinstance(other, EvaluationRecord) or type(self) is not type(other):
            return NotImplemented
        if super().__lt__(other):
            return True
        elif super().__eq__(other):
            return self.evaluator_name < other.evaluator_name
        else:
            return False

    def __hash__(self) -> int:
        """Generates a hash for the evaluation record.

        Returns:
            int: The hash of the evaluation record.
        """
        return hash(
            (
                self.dataset_record,
                self.generator_name,
                self.task_name,
                self.model_record,
                self.experiment_name,
                self.evaluator_name,
            )
        )

generation_record `property`

generation_record: GenerationRecord

Generates a generation record from the evaluation record.

Returns:

Name	Type	Description
`GenerationRecord`	`GenerationRecord`	The generation record.

label `property`

label: str

Generates a label for the evaluation record.

Returns:

Name	Type	Description
`str`	`str`	The label for the evaluation record.

eq

__eq__(other: object) -> bool

Checks equality with another record.

Parameters:

Name	Type	Description	Default
`other`	`object`	The other record to compare with.	required

Returns:

Name	Type	Description
`bool`	`bool`	True if the records are equal, False otherwise.

Source code in evalsense/evaluation/experiment.py

def __eq__(self, other: object) -> bool:
    """Checks equality with another record.

    Args:
        other (object): The other record to compare with.

    Returns:
        bool: True if the records are equal, False otherwise.
    """
    if not isinstance(other, EvaluationRecord) or type(self) is not type(other):
        return NotImplemented
    generation_equal = super().__eq__(other)
    if generation_equal is NotImplemented:
        return generation_equal
    return generation_equal and self.evaluator_name == other.evaluator_name

hash

__hash__() -> int

Generates a hash for the evaluation record.

Returns:

Name	Type	Description
`int`	`int`	The hash of the evaluation record.

Source code in evalsense/evaluation/experiment.py

def __hash__(self) -> int:
    """Generates a hash for the evaluation record.

    Returns:
        int: The hash of the evaluation record.
    """
    return hash(
        (
            self.dataset_record,
            self.generator_name,
            self.task_name,
            self.model_record,
            self.experiment_name,
            self.evaluator_name,
        )
    )

lt

__lt__(other: object) -> bool

Checks if this record is less than another record.

Parameters:

Name	Type	Description	Default
`other`	`object`	The other record to compare with.	required

Returns:

Name	Type	Description
`bool`	`bool`	True if this record is less than the other, False otherwise.

Source code in evalsense/evaluation/experiment.py

def __lt__(self, other: object) -> bool:
    """Checks if this record is less than another record.

    Args:
        other (object): The other record to compare with.

    Returns:
        bool: True if this record is less than the other, False otherwise.
    """
    if not isinstance(other, EvaluationRecord) or type(self) is not type(other):
        return NotImplemented
    if super().__lt__(other):
        return True
    elif super().__eq__(other):
        return self.evaluator_name < other.evaluator_name
    else:
        return False

get_evaluation_record

get_evaluation_record(
    evaluator_name: str,
) -> EvaluationRecord

Generates an evaluation record from the generation record.

Parameters:

Name	Type	Description	Default
`evaluator_name`	`str`	The name of the evaluator.	required

Returns:

Name	Type	Description
`EvaluationRecord`	`EvaluationRecord`	The evaluation record.

Source code in evalsense/evaluation/experiment.py

def get_evaluation_record(self, evaluator_name: str) -> "EvaluationRecord":
    """Generates an evaluation record from the generation record.

    Args:
        evaluator_name (str): The name of the evaluator.

    Returns:
        EvaluationRecord: The evaluation record.
    """
    return EvaluationRecord(
        **self.model_dump(),
        evaluator_name=evaluator_name,
    )

get_meta_grouped_record

get_meta_grouped_record(
    metric_name: str,
) -> MetaTierGroupedRecord

Generates a perturbation grouped record from the evaluation record.

Parameters:

Name	Type	Description	Default
`metric_name`	`str`	The name of the metric being evaluated.	required

Returns:

Name	Type	Description
`PerturbationGroupedRecord`	`MetaTierGroupedRecord`	The perturbation grouped record.

Source code in evalsense/evaluation/experiment.py

def get_meta_grouped_record(self, metric_name: str) -> "MetaTierGroupedRecord":
    """Generates a perturbation grouped record from the evaluation record.

    Args:
        metric_name (str): The name of the metric being evaluated.

    Returns:
        PerturbationGroupedRecord: The perturbation grouped record.
    """
    return MetaTierGroupedRecord(
        **self.model_dump(exclude={"generator_name"}),
        generator_name="",
        metric_name=metric_name,
    )

Evaluator `dataclass`

A class for LLM output evaluators.

Attributes:

Name	Type	Description
`model_name`	`str`	Retrieves the model name associated with the evaluator config.

Source code in evalsense/evaluation/evaluator.py

@dataclass
class Evaluator:
    """A class for LLM output evaluators."""

    name: str
    scorer: Scorer | ScorerFactory
    model_config: ModelConfig | None = None
    cleanup_fun: Callable[[], None] | None = None

    @property
    def model_name(self) -> str:
        """Retrieves the model name associated with the evaluator config.

        Returns an empty string if the evaluator doesn't use a model config.

        Returns:
            str: The name of the model in the config or empty string.
        """
        if self.model_config is None:
            return ""
        return self.model_config.name

model_name `property`

model_name: str

Retrieves the model name associated with the evaluator config.

Returns an empty string if the evaluator doesn't use a model config.

Returns:

Name	Type	Description
`str`	`str`	The name of the model in the config or empty string.

ExperimentBatchConfig `dataclass`

Configuration for a batch of experiments to be executed by a pipeline.

Methods:

Name	Description
`validate`	Validates the experiment configuration.

Attributes:

Name	Type	Description
`all_experiments`	`list[ExperimentConfig]`	Generates a list of all experiments in the batch.

Source code in evalsense/evaluation/experiment.py

@dataclass
class ExperimentBatchConfig:
    """Configuration for a batch of experiments to be executed by a pipeline."""

    tasks: list[TaskConfig]
    model_configs: list[ModelConfig]
    evaluators: list[Evaluator] = field(default_factory=list)
    name: str | None = None

    def validate(self) -> None:
        """Validates the experiment configuration.

        Raises:
            ValueError: If the configuration is invalid.
        """
        if not self.tasks:
            raise ValueError("Experiment must have at least one task.")
        if not self.model_configs:
            raise ValueError("Experiment must have at least one LLM manager.")

    @property
    def all_experiments(self) -> list[ExperimentConfig]:
        """Generates a list of all experiments in the batch.

        Returns:
            list[ExperimentConfig]: A list of all experiments in the batch.
        """
        experiments = []
        for task in self.tasks:
            for llm_manager in self.model_configs:
                if self.evaluators:
                    for evaluator in self.evaluators:
                        experiments.append(
                            ExperimentConfig(
                                dataset_manager=task.dataset_manager,
                                generation_steps=task.generation_steps,
                                field_spec=task.field_spec,
                                task_preprocessor=task.task_preprocessor,
                                model_config=llm_manager,
                                evaluator=evaluator,
                                name=self.name,
                            )
                        )
                else:
                    experiments.append(
                        ExperimentConfig(
                            dataset_manager=task.dataset_manager,
                            generation_steps=task.generation_steps,
                            field_spec=task.field_spec,
                            task_preprocessor=task.task_preprocessor,
                            model_config=llm_manager,
                            name=self.name,
                        )
                    )
        return experiments

all_experiments `property`

all_experiments: list[ExperimentConfig]

Generates a list of all experiments in the batch.

Returns:

Type	Description
`list[ExperimentConfig]`	list[ExperimentConfig]: A list of all experiments in the batch.

validate

validate() -> None

Validates the experiment configuration.

Raises:

Type	Description
`ValueError`	If the configuration is invalid.

Source code in evalsense/evaluation/experiment.py

def validate(self) -> None:
    """Validates the experiment configuration.

    Raises:
        ValueError: If the configuration is invalid.
    """
    if not self.tasks:
        raise ValueError("Experiment must have at least one task.")
    if not self.model_configs:
        raise ValueError("Experiment must have at least one LLM manager.")

ExperimentConfig `dataclass`

Configuration for an experiment to be executed by a pipeline.

Attributes:

Name	Type	Description
`evaluation_record`	`EvaluationRecord`	A identifying evaluations for a specific task.
`generation_record`	`GenerationRecord`	A identifying generations for a specific task.

Source code in evalsense/evaluation/experiment.py

@dataclass
class ExperimentConfig:
    """Configuration for an experiment to be executed by a pipeline."""

    dataset_manager: DatasetManager
    generation_steps: GenerationSteps
    model_config: ModelConfig
    field_spec: FieldSpec | RecordToSample | None = None
    task_preprocessor: TaskPreprocessor = field(
        default_factory=lambda: DefaultTaskPreprocessor()
    )
    evaluator: Evaluator | None = None
    name: str | None = None

    @property
    def generation_record(self) -> GenerationRecord:
        """A identifying generations for a specific task.

        Returns:
            GenerationsRecord: A record of the generations for the experiment.
        """
        return GenerationRecord(
            dataset_record=self.dataset_manager.record,
            generator_name=self.generation_steps.name,
            task_name=self.task_preprocessor.name,
            model_record=self.model_config.record,
            experiment_name=self.name,
        )

    @property
    def evaluation_record(self) -> EvaluationRecord:
        """A identifying evaluations for a specific task.

        Returns:
            EvaluationRecord: A record of the evaluations for the experiment.
        """
        if self.evaluator is None:
            raise ValueError(
                "Cannot get evaluation record for an experiment without an evaluator"
            )

        return self.generation_record.get_evaluation_record(
            self.evaluator.name,
        )

evaluation_record `property`

evaluation_record: EvaluationRecord

A identifying evaluations for a specific task.

Returns:

Name	Type	Description
`EvaluationRecord`	`EvaluationRecord`	A record of the evaluations for the experiment.

generation_record `property`

generation_record: GenerationRecord

A identifying generations for a specific task.

Returns:

Name	Type	Description
`GenerationsRecord`	`GenerationRecord`	A record of the generations for the experiment.

GenerationRecord

Bases: BaseModel

A record identifying generations for a specific task.

Attributes:

Name	Type	Description
`dataset_record`	`DatasetRecord`	The record of the dataset.
`generator_name`	`str`	The name of the generator.
`task_name`	`str`	The name of the task.
`model_record`	`ModelRecord`	The record of the model.
`experiment_name`	`str \| None`	The name of the experiment, if applicable.

Methods:

Name	Description
`__eq__`	Checks equality with another record.
`__hash__`	Generates a hash for the generation record.
`__lt__`	Checks if this record is less than another record.
`get_evaluation_record`	Generates an evaluation record from the generation record.

Source code in evalsense/evaluation/experiment.py

@total_ordering
class GenerationRecord(BaseModel, frozen=True):
    """A record identifying generations for a specific task.

    Attributes:
        dataset_record (DatasetRecord): The record of the dataset.
        generator_name (str): The name of the generator.
        task_name (str): The name of the task.
        model_record (ModelRecord): The record of the model.
        experiment_name (str | None): The name of the experiment, if applicable.
    """

    dataset_record: DatasetRecord
    generator_name: str
    task_name: str
    model_record: ModelRecord
    experiment_name: str | None = None

    def get_evaluation_record(self, evaluator_name: str) -> "EvaluationRecord":
        """Generates an evaluation record from the generation record.

        Args:
            evaluator_name (str): The name of the evaluator.

        Returns:
            EvaluationRecord: The evaluation record.
        """
        return EvaluationRecord(
            **self.model_dump(),
            evaluator_name=evaluator_name,
        )

    @property
    def label(self) -> str:
        """Generates a label for the generation record.

        Returns:
            str: The label for the generation record.
        """
        return (
            f"{self.dataset_record.name} | {self.task_name} | "
            f"{self.generator_name} | {self.model_record.name}"
        )

    def __eq__(self, other: object) -> bool:
        """Checks equality with another record.

        Args:
            other (object): The other record to compare with.

        Returns:
            bool: True if the records are equal, False otherwise.
        """
        if not isinstance(other, GenerationRecord) or type(self) is not type(other):
            return NotImplemented
        return (
            self.dataset_record == other.dataset_record
            and self.generator_name == other.generator_name
            and self.task_name == other.task_name
            and self.model_record == other.model_record
            and self.experiment_name == other.experiment_name
        )

    def __lt__(self, other: object) -> bool:
        """Checks if this record is less than another record.

        Args:
            other (object): The other record to compare with.

        Returns:
            bool: True if this record is less than the other, False otherwise.
        """
        if not isinstance(other, GenerationRecord) or type(self) is not type(other):
            return NotImplemented
        return (
            self.dataset_record,
            self.generator_name,
            self.task_name,
            self.model_record,
            self.experiment_name or "",
        ) < (
            other.dataset_record,
            other.generator_name,
            other.task_name,
            other.model_record,
            other.experiment_name or "",
        )

    def __hash__(self) -> int:
        """Generates a hash for the generation record.

        Returns:
            int: The hash of the generation record.
        """
        return hash(
            (
                self.dataset_record,
                self.generator_name,
                self.task_name,
                self.model_record,
                self.experiment_name,
            )
        )

label `property`

label: str

Generates a label for the generation record.

Returns:

Name	Type	Description
`str`	`str`	The label for the generation record.

eq

__eq__(other: object) -> bool

Checks equality with another record.

Parameters:

Name	Type	Description	Default
`other`	`object`	The other record to compare with.	required

Returns:

Name	Type	Description
`bool`	`bool`	True if the records are equal, False otherwise.

Source code in evalsense/evaluation/experiment.py

def __eq__(self, other: object) -> bool:
    """Checks equality with another record.

    Args:
        other (object): The other record to compare with.

    Returns:
        bool: True if the records are equal, False otherwise.
    """
    if not isinstance(other, GenerationRecord) or type(self) is not type(other):
        return NotImplemented
    return (
        self.dataset_record == other.dataset_record
        and self.generator_name == other.generator_name
        and self.task_name == other.task_name
        and self.model_record == other.model_record
        and self.experiment_name == other.experiment_name
    )

hash

__hash__() -> int

Generates a hash for the generation record.

Returns:

Name	Type	Description
`int`	`int`	The hash of the generation record.

Source code in evalsense/evaluation/experiment.py

def __hash__(self) -> int:
    """Generates a hash for the generation record.

    Returns:
        int: The hash of the generation record.
    """
    return hash(
        (
            self.dataset_record,
            self.generator_name,
            self.task_name,
            self.model_record,
            self.experiment_name,
        )
    )

lt

__lt__(other: object) -> bool

Checks if this record is less than another record.

Parameters:

Name	Type	Description	Default
`other`	`object`	The other record to compare with.	required

Returns:

Name	Type	Description
`bool`	`bool`	True if this record is less than the other, False otherwise.

Source code in evalsense/evaluation/experiment.py

def __lt__(self, other: object) -> bool:
    """Checks if this record is less than another record.

    Args:
        other (object): The other record to compare with.

    Returns:
        bool: True if this record is less than the other, False otherwise.
    """
    if not isinstance(other, GenerationRecord) or type(self) is not type(other):
        return NotImplemented
    return (
        self.dataset_record,
        self.generator_name,
        self.task_name,
        self.model_record,
        self.experiment_name or "",
    ) < (
        other.dataset_record,
        other.generator_name,
        other.task_name,
        other.model_record,
        other.experiment_name or "",
    )

get_evaluation_record

get_evaluation_record(
    evaluator_name: str,
) -> EvaluationRecord

Generates an evaluation record from the generation record.

Parameters:

Name	Type	Description	Default
`evaluator_name`	`str`	The name of the evaluator.	required

Returns:

Name	Type	Description
`EvaluationRecord`	`EvaluationRecord`	The evaluation record.

Source code in evalsense/evaluation/experiment.py

def get_evaluation_record(self, evaluator_name: str) -> "EvaluationRecord":
    """Generates an evaluation record from the generation record.

    Args:
        evaluator_name (str): The name of the evaluator.

    Returns:
        EvaluationRecord: The evaluation record.
    """
    return EvaluationRecord(
        **self.model_dump(),
        evaluator_name=evaluator_name,
    )

MetaTierGroupedRecord

Bases: EvaluationRecord

A record grouping evaluation records by generator name (as generator name specifies the meta tier, which defines the expected score ranking).

Attributes:

Name	Type	Description
`dataset_record`	`DatasetRecord`	The record of the dataset.
`generator_name`	`str`	The name of the generator.
`task_name`	`str`	The name of the task.
`model_record`	`ModelRecord`	The record of the model.
`experiment_name`	`str \| None`	The name of the experiment, if applicable.
`evaluator_name`	`str`	The name of the evaluator.
`metric_name`	`str`	The name of the metric being evaluated.

Methods:

Name	Description
`__eq__`	Checks equality with another record.
`__hash__`	Generates a hash for the perturbation grouped record.
`__lt__`	Checks if this record is less than another record.
`get_evaluation_record`	Generates an evaluation record from the generation record.
`get_meta_grouped_record`	Generates a perturbation grouped record from the evaluation record.

Source code in evalsense/evaluation/experiment.py

@total_ordering
class MetaTierGroupedRecord(EvaluationRecord, frozen=True):
    """A record grouping evaluation records by generator name
    (as generator name specifies the meta tier, which defines
    the expected score ranking).

    Attributes:
        dataset_record (DatasetRecord): The record of the dataset.
        generator_name (str): The name of the generator.
        task_name (str): The name of the task.
        model_record (ModelRecord): The record of the model.
        experiment_name (str | None): The name of the experiment, if applicable.
        evaluator_name (str): The name of the evaluator.
        metric_name (str): The name of the metric being evaluated.
    """

    metric_name: str

    def __eq__(self, other: object) -> bool:
        """Checks equality with another record.

        Args:
            other (object): The other record to compare with.

        Returns:
            bool: True if the records are equal, False otherwise.
        """
        if not isinstance(other, MetaTierGroupedRecord) or type(self) is not type(
            other
        ):
            return NotImplemented
        generation_equal = super().__eq__(other)
        if generation_equal is NotImplemented:
            return generation_equal
        return generation_equal and self.metric_name == other.metric_name

    def __lt__(self, other: object) -> bool:
        """Checks if this record is less than another record.

        Args:
            other (object): The other record to compare with.

        Returns:
            bool: True if this record is less than the other, False otherwise.
        """
        if not isinstance(other, MetaTierGroupedRecord) or type(self) is not type(
            other
        ):
            return NotImplemented
        if super().__lt__(other):
            return True
        elif super().__eq__(other):
            return self.metric_name < other.metric_name
        else:
            return False

    def __hash__(self) -> int:
        """Generates a hash for the perturbation grouped record.

        Returns:
            int: The hash of the perturbation grouped record.
        """
        return hash(
            (
                self.dataset_record,
                self.generator_name,
                self.task_name,
                self.model_record,
                self.experiment_name,
                self.evaluator_name,
                self.metric_name,
            )
        )

generation_record `property`

generation_record: GenerationRecord

Generates a generation record from the evaluation record.

Returns:

Name	Type	Description
`GenerationRecord`	`GenerationRecord`	The generation record.

label `property`

label: str

Generates a label for the evaluation record.

Returns:

Name	Type	Description
`str`	`str`	The label for the evaluation record.

eq

__eq__(other: object) -> bool

Checks equality with another record.

Parameters:

Name	Type	Description	Default
`other`	`object`	The other record to compare with.	required

Returns:

Name	Type	Description
`bool`	`bool`	True if the records are equal, False otherwise.

Source code in evalsense/evaluation/experiment.py

def __eq__(self, other: object) -> bool:
    """Checks equality with another record.

    Args:
        other (object): The other record to compare with.

    Returns:
        bool: True if the records are equal, False otherwise.
    """
    if not isinstance(other, MetaTierGroupedRecord) or type(self) is not type(
        other
    ):
        return NotImplemented
    generation_equal = super().__eq__(other)
    if generation_equal is NotImplemented:
        return generation_equal
    return generation_equal and self.metric_name == other.metric_name

hash

__hash__() -> int

Generates a hash for the perturbation grouped record.

Returns:

Name	Type	Description
`int`	`int`	The hash of the perturbation grouped record.

Source code in evalsense/evaluation/experiment.py

def __hash__(self) -> int:
    """Generates a hash for the perturbation grouped record.

    Returns:
        int: The hash of the perturbation grouped record.
    """
    return hash(
        (
            self.dataset_record,
            self.generator_name,
            self.task_name,
            self.model_record,
            self.experiment_name,
            self.evaluator_name,
            self.metric_name,
        )
    )

lt

__lt__(other: object) -> bool

Checks if this record is less than another record.

Parameters:

Name	Type	Description	Default
`other`	`object`	The other record to compare with.	required

Returns:

Name	Type	Description
`bool`	`bool`	True if this record is less than the other, False otherwise.

Source code in evalsense/evaluation/experiment.py

def __lt__(self, other: object) -> bool:
    """Checks if this record is less than another record.

    Args:
        other (object): The other record to compare with.

    Returns:
        bool: True if this record is less than the other, False otherwise.
    """
    if not isinstance(other, MetaTierGroupedRecord) or type(self) is not type(
        other
    ):
        return NotImplemented
    if super().__lt__(other):
        return True
    elif super().__eq__(other):
        return self.metric_name < other.metric_name
    else:
        return False

get_evaluation_record

get_evaluation_record(
    evaluator_name: str,
) -> EvaluationRecord

Generates an evaluation record from the generation record.

Parameters:

Name	Type	Description	Default
`evaluator_name`	`str`	The name of the evaluator.	required

Returns:

Name	Type	Description
`EvaluationRecord`	`EvaluationRecord`	The evaluation record.

Source code in evalsense/evaluation/experiment.py

def get_evaluation_record(self, evaluator_name: str) -> "EvaluationRecord":
    """Generates an evaluation record from the generation record.

    Args:
        evaluator_name (str): The name of the evaluator.

    Returns:
        EvaluationRecord: The evaluation record.
    """
    return EvaluationRecord(
        **self.model_dump(),
        evaluator_name=evaluator_name,
    )

get_meta_grouped_record

get_meta_grouped_record(
    metric_name: str,
) -> MetaTierGroupedRecord

Generates a perturbation grouped record from the evaluation record.

Parameters:

Name	Type	Description	Default
`metric_name`	`str`	The name of the metric being evaluated.	required

Returns:

Name	Type	Description
`PerturbationGroupedRecord`	`MetaTierGroupedRecord`	The perturbation grouped record.

Source code in evalsense/evaluation/experiment.py

def get_meta_grouped_record(self, metric_name: str) -> "MetaTierGroupedRecord":
    """Generates a perturbation grouped record from the evaluation record.

    Args:
        metric_name (str): The name of the metric being evaluated.

    Returns:
        PerturbationGroupedRecord: The perturbation grouped record.
    """
    return MetaTierGroupedRecord(
        **self.model_dump(exclude={"generator_name"}),
        generator_name="",
        metric_name=metric_name,
    )

ResultRecord

Bases: BaseModel

A record indicating the result of generation or evaluation.

Attributes:

Name	Type	Description
`status`	`RecordStatus`	The status of the record.
`error_message`	`str \| None`	The error message, if any.
`log_location`	`str \| None`	The location of the associated Inspect log file.

Source code in evalsense/evaluation/experiment.py

class ResultRecord(BaseModel, frozen=True):
    """A record indicating the result of generation or evaluation.

    Attributes:
        status (RecordStatus): The status of the record.
        error_message (str | None): The error message, if any.
        log_location (str | None): The location of the associated Inspect log file.
    """

    status: RecordStatus = "started"
    error_message: str | None = None
    log_location: str | None = None

ScoreCalculator

Bases: Protocol

A protocol for computing evaluation scores.

Methods:

Name	Description
`calculate`	Computes evaluation scores for the given evaluation method
`calculate_async`	Asynchronously computes evaluation scores for the given evaluation method

Source code in evalsense/evaluation/evaluator.py

@runtime_checkable
class ScoreCalculator(Protocol):
    """A protocol for computing evaluation scores."""

    @abstractmethod
    def calculate(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        **kwargs: dict,
    ) -> Score:
        """Computes evaluation scores for the given evaluation method

        Args:
            prediction (str): The model output to evaluate.
            input (str, optional): The input to the model. Optional.
            reference (str, optional): The reference output to compare against.
                Optional.
            metadata (dict[str, Any], optional): Additional Inspect AI sample/task
                state metadata. Optional.
            **kwargs (dict): Additional keyword arguments specific to the given
                evaluation method.

        Returns:
            Score: The Inspect AI Score object with the calculated result.
        """
        pass

    @abstractmethod
    async def calculate_async(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        **kwargs: dict,
    ) -> Score:
        """Asynchronously computes evaluation scores for the given evaluation method

        Args:
            prediction (str): The model output to evaluate.
            input (str, optional): The input to the model. Optional.
            reference (str, optional): The reference output to compare against.
                Optional.
            metadata (dict[str, Any], optional): Additional Inspect AI sample/task
                state metadata. Optional.
            **kwargs (dict): Additional keyword arguments specific to the given
                evaluation method.

        Returns:
            Score: The Inspect AI Score object with the calculated result.
        """
        pass

calculate `abstractmethod`

calculate(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score

Computes evaluation scores for the given evaluation method

Parameters:

Name	Type	Description	Default
`prediction`	`str`	The model output to evaluate.	required
`input`	`str`	The input to the model. Optional.	`None`
`reference`	`str`	The reference output to compare against. Optional.	`None`
`metadata`	`dict[str, Any]`	Additional Inspect AI sample/task state metadata. Optional.	`None`
`**kwargs`	`dict`	Additional keyword arguments specific to the given evaluation method.	`{}`

Returns:

Name	Type	Description
`Score`	`Score`	The Inspect AI Score object with the calculated result.

Source code in evalsense/evaluation/evaluator.py

@abstractmethod
def calculate(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score:
    """Computes evaluation scores for the given evaluation method

    Args:
        prediction (str): The model output to evaluate.
        input (str, optional): The input to the model. Optional.
        reference (str, optional): The reference output to compare against.
            Optional.
        metadata (dict[str, Any], optional): Additional Inspect AI sample/task
            state metadata. Optional.
        **kwargs (dict): Additional keyword arguments specific to the given
            evaluation method.

    Returns:
        Score: The Inspect AI Score object with the calculated result.
    """
    pass

calculate_async `abstractmethod` `async`

calculate_async(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score

Asynchronously computes evaluation scores for the given evaluation method

Parameters:

Name	Type	Description	Default
`prediction`	`str`	The model output to evaluate.	required
`input`	`str`	The input to the model. Optional.	`None`
`reference`	`str`	The reference output to compare against. Optional.	`None`
`metadata`	`dict[str, Any]`	Additional Inspect AI sample/task state metadata. Optional.	`None`
`**kwargs`	`dict`	Additional keyword arguments specific to the given evaluation method.	`{}`

Returns:

Name	Type	Description
`Score`	`Score`	The Inspect AI Score object with the calculated result.

Source code in evalsense/evaluation/evaluator.py

@abstractmethod
async def calculate_async(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score:
    """Asynchronously computes evaluation scores for the given evaluation method

    Args:
        prediction (str): The model output to evaluate.
        input (str, optional): The input to the model. Optional.
        reference (str, optional): The reference output to compare against.
            Optional.
        metadata (dict[str, Any], optional): Additional Inspect AI sample/task
            state metadata. Optional.
        **kwargs (dict): Additional keyword arguments specific to the given
            evaluation method.

    Returns:
        Score: The Inspect AI Score object with the calculated result.
    """
    pass

ScorerFactory

Bases: Protocol

A protocol for constructing a Scorer given a Model.

Methods:

Name	Description
`create_scorer`	Creates a Scorer from a Model.

Source code in evalsense/evaluation/evaluator.py

@runtime_checkable
class ScorerFactory(Protocol):
    """A protocol for constructing a Scorer given a Model."""

    @abstractmethod
    def create_scorer(self, model: Model) -> Scorer:
        """Creates a Scorer from a Model.

        Args:
            model (Model): The model to create a scorer for.

        Returns:
            Scorer: The created scorer.
        """
        pass

create_scorer `abstractmethod`

create_scorer(model: Model) -> Scorer

Creates a Scorer from a Model.

Parameters:

Name	Type	Description	Default
`model`	`Model`	The model to create a scorer for.	required

Returns:

Name	Type	Description
`Scorer`	`Scorer`	The created scorer.

Source code in evalsense/evaluation/evaluator.py

@abstractmethod
def create_scorer(self, model: Model) -> Scorer:
    """Creates a Scorer from a Model.

    Args:
        model (Model): The model to create a scorer for.

    Returns:
        Scorer: The created scorer.
    """
    pass

TaskConfig `dataclass`

Configuration for a task to be executed by a pipeline.

Source code in evalsense/evaluation/experiment.py

@dataclass
class TaskConfig:
    """Configuration for a task to be executed by a pipeline."""

    dataset_manager: DatasetManager
    generation_steps: GenerationSteps
    field_spec: FieldSpec | RecordToSample | None = None
    task_preprocessor: TaskPreprocessor = field(
        default_factory=lambda: DefaultTaskPreprocessor()
    )

Evaluation

EvaluationRecord

generation_record property

label property

__eq__

__hash__

__lt__

get_evaluation_record

get_meta_grouped_record

Evaluator dataclass

model_name property

ExperimentBatchConfig dataclass

all_experiments property

validate

ExperimentConfig dataclass

evaluation_record property

generation_record property

GenerationRecord

label property

__eq__

__hash__

__lt__

get_evaluation_record

MetaTierGroupedRecord

generation_record property

label property

__eq__

__hash__

__lt__

get_evaluation_record

get_meta_grouped_record

ResultRecord

ScoreCalculator

calculate abstractmethod

calculate_async abstractmethod async

ScorerFactory

create_scorer abstractmethod

TaskConfig dataclass

generation_record `property`

label `property`

eq

hash

lt

Evaluator `dataclass`

model_name `property`

ExperimentBatchConfig `dataclass`

all_experiments `property`

ExperimentConfig `dataclass`

evaluation_record `property`

generation_record `property`

label `property`

eq

hash

lt

generation_record `property`

label `property`

eq

hash

lt

calculate `abstractmethod`

calculate_async `abstractmethod` `async`

create_scorer `abstractmethod`

TaskConfig `dataclass`