Skip to content

Evaluators

Modules:

Name Description
bertscore
bleu
g_eval
qags
rouge

Classes:

Name Description
BertScoreCalculator

Calculator for computing BERTScores.

BleuPrecisionScoreCalculator

Calculator for computing BLEU scores.

GEvalScoreCalculator

G-Eval score calculator.

GEvalScorerFactory

Scorer factory for G-Eval.

QagsConfig

A protocol for configuring QAGS evaluation.

QagsScoreCalculator

QAGS score calculator.

QagsScorerFactory

Scorer factory for QAGS.

RougeScoreCalculator

Calculator for computing ROUGE scores.

Functions:

Name Description
bleu_metric

Base metric for BLEU scores.

get_bertscore_evaluator

Returns a BERTScore evaluator.

get_bleu_evaluator

Returns an evaluator for BLEU scores.

get_g_eval_evaluator

Constructs a G-Eval evaluator that can be used in EvalSense evaluation pipeline.

get_qags_evaluator

Constructs a QAGS evaluator that can be used in EvalSense evaluation pipeline.

get_rouge_evaluator

Returns an evaluator for ROUGE scores.

BertScoreCalculator

Bases: ScoreCalculator

Calculator for computing BERTScores.

Methods:

Name Description
__init__

Initializes the BERTScore calculator.

calculate

Calculates BERTScore for the supplied model prediction and reference input.

calculate_async

Calculates BERTScore for the supplied model prediction and reference input.

Source code in evalsense/evaluation/evaluators/bertscore.py
class BertScoreCalculator(ScoreCalculator):
    """Calculator for computing BERTScores."""

    def __init__(
        self,
        model_type: str = "microsoft/deberta-xlarge-mnli",
        lang: str = "en",
        num_layers: int | None = None,
        idf: bool | dict[str, float] = False,
    ):
        """
        Initializes the BERTScore calculator.

        Args:
            model_type (str, optional): The model type to use for computing BERTScore.
                Defaults to "microsoft/deberta-xlarge-mnli", the currently best-performing
                model according to BERTScore authors.
            lang (str, optional): The language of the text. Defaults to "en".
            num_layers (int, optional): The layer of representations to use.
            idf (bool | dict[str, float], optional): Use IDF weighting — can be a precomputed IDF dictionary.
        """
        self.bertscore_module = evaluate.load("bertscore")
        self.model_type = model_type
        self.lang = lang
        self.num_layers = num_layers
        self.idf = idf

    @override
    def calculate(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        verbose=False,
        device=None,
        batch_size=64,
        nthreads=1,
        rescale_with_baseline=False,
        baseline_path=None,
        use_fast_tokenizer=False,
        **kwargs: dict,
    ) -> Score:
        """
        Calculates BERTScore for the supplied model prediction and reference input.

        Args:
            prediction (str): The text of the prediction from the model.
            input (str, optional): The text of the model input. Ignored for BERTScore.
            reference (str, optional): The text of the reference input to compare against.
            metadata (dict[str, Any], optional): Metadata for the evaluation. Ignored for BERTScore.
            verbose (bool): Whether to turn on verbose mode.
            device (str, optional): The device to use for computing the contextual embeddings.
            batch_size (int): The batch size to use for computing the contextual embeddings.
            nthreads (int): The number of threads to use for computing the contextual embeddings.
            rescale_with_baseline (bool): Whether to rescale the BERTScore with pre-computed baseline.
            baseline_path (str, optional): Customized baseline file.
            use_fast_tokenizer (bool): The `use_fast` parameter passed to HF tokenizer.

        Returns:
            Score: Inspect AI Score with the calculated evaluation results.
        """
        if reference is None:
            raise ValueError(
                "Reference is required for computing BERTScore, but was None."
            )

        predictions = [prediction]
        references = [reference]

        result = self.bertscore_module.compute(
            predictions=predictions,
            references=references,
            lang=self.lang,
            model_type=self.model_type,
            num_layers=self.num_layers,
            verbose=verbose,
            idf=self.idf,
            device=device,
            batch_size=batch_size,
            nthreads=nthreads,
            rescale_with_baseline=rescale_with_baseline,
            baseline_path=baseline_path,
            use_fast_tokenizer=use_fast_tokenizer,
        )
        return Score(
            value={
                "BERTScore Precision": result["precision"][0],  # type: ignore
                "BERTScore Recall": result["recall"][0],  # type: ignore
                "BERTScore F1": result["f1"][0],  # type: ignore
            },
            answer=prediction,
            metadata={
                "hashcode": result["hashcode"],  # type: ignore
            },
        )

    @override
    async def calculate_async(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        verbose=False,
        device=None,
        batch_size=64,
        nthreads=1,
        rescale_with_baseline=False,
        baseline_path=None,
        use_fast_tokenizer=False,
        **kwargs: dict,
    ) -> Score:
        """
        Calculates BERTScore for the supplied model prediction and reference input.

        Args:
            prediction (str): The text of the prediction from the model.
            input (str | None): The text of the model input. Ignored for BERTScore.
            reference (str, optional): The text of the reference input to compare against.
            metadata (dict[str, Any] | None): Metadata for the evaluation. Ignored for BERTScore.
            verbose (bool): Whether to turn on verbose mode.
            device (str | None): The device to use for computing the contextual embeddings.
            batch_size (int): The batch size to use for computing the contextual embeddings.
            nthreads (int): The number of threads to use for computing the contextual embeddings.
            rescale_with_baseline (bool): Whether to rescale the BERTScore with pre-computed baseline.
            baseline_path (str | None): Customized baseline file.
            use_fast_tokenizer (bool): The `use_fast` parameter passed to HF tokenizer.

        Returns:
            Score: Inspect AI Score with the calculated evaluation results.
        """
        return self.calculate(
            prediction=prediction,
            reference=reference,
            verbose=verbose,
            device=device,
            batch_size=batch_size,
            nthreads=nthreads,
            rescale_with_baseline=rescale_with_baseline,
            baseline_path=baseline_path,
            use_fast_tokenizer=use_fast_tokenizer,
        )

__init__

__init__(
    model_type: str = "microsoft/deberta-xlarge-mnli",
    lang: str = "en",
    num_layers: int | None = None,
    idf: bool | dict[str, float] = False,
)

Initializes the BERTScore calculator.

Parameters:

Name Type Description Default
model_type str

The model type to use for computing BERTScore. Defaults to "microsoft/deberta-xlarge-mnli", the currently best-performing model according to BERTScore authors.

'microsoft/deberta-xlarge-mnli'
lang str

The language of the text. Defaults to "en".

'en'
num_layers int

The layer of representations to use.

None
idf bool | dict[str, float]

Use IDF weighting — can be a precomputed IDF dictionary.

False
Source code in evalsense/evaluation/evaluators/bertscore.py
def __init__(
    self,
    model_type: str = "microsoft/deberta-xlarge-mnli",
    lang: str = "en",
    num_layers: int | None = None,
    idf: bool | dict[str, float] = False,
):
    """
    Initializes the BERTScore calculator.

    Args:
        model_type (str, optional): The model type to use for computing BERTScore.
            Defaults to "microsoft/deberta-xlarge-mnli", the currently best-performing
            model according to BERTScore authors.
        lang (str, optional): The language of the text. Defaults to "en".
        num_layers (int, optional): The layer of representations to use.
        idf (bool | dict[str, float], optional): Use IDF weighting — can be a precomputed IDF dictionary.
    """
    self.bertscore_module = evaluate.load("bertscore")
    self.model_type = model_type
    self.lang = lang
    self.num_layers = num_layers
    self.idf = idf

calculate

calculate(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    verbose=False,
    device=None,
    batch_size=64,
    nthreads=1,
    rescale_with_baseline=False,
    baseline_path=None,
    use_fast_tokenizer=False,
    **kwargs: dict,
) -> Score

Calculates BERTScore for the supplied model prediction and reference input.

Parameters:

Name Type Description Default
prediction str

The text of the prediction from the model.

required
input str

The text of the model input. Ignored for BERTScore.

None
reference str

The text of the reference input to compare against.

None
metadata dict[str, Any]

Metadata for the evaluation. Ignored for BERTScore.

None
verbose bool

Whether to turn on verbose mode.

False
device str

The device to use for computing the contextual embeddings.

None
batch_size int

The batch size to use for computing the contextual embeddings.

64
nthreads int

The number of threads to use for computing the contextual embeddings.

1
rescale_with_baseline bool

Whether to rescale the BERTScore with pre-computed baseline.

False
baseline_path str

Customized baseline file.

None
use_fast_tokenizer bool

The use_fast parameter passed to HF tokenizer.

False

Returns:

Name Type Description
Score Score

Inspect AI Score with the calculated evaluation results.

Source code in evalsense/evaluation/evaluators/bertscore.py
@override
def calculate(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    verbose=False,
    device=None,
    batch_size=64,
    nthreads=1,
    rescale_with_baseline=False,
    baseline_path=None,
    use_fast_tokenizer=False,
    **kwargs: dict,
) -> Score:
    """
    Calculates BERTScore for the supplied model prediction and reference input.

    Args:
        prediction (str): The text of the prediction from the model.
        input (str, optional): The text of the model input. Ignored for BERTScore.
        reference (str, optional): The text of the reference input to compare against.
        metadata (dict[str, Any], optional): Metadata for the evaluation. Ignored for BERTScore.
        verbose (bool): Whether to turn on verbose mode.
        device (str, optional): The device to use for computing the contextual embeddings.
        batch_size (int): The batch size to use for computing the contextual embeddings.
        nthreads (int): The number of threads to use for computing the contextual embeddings.
        rescale_with_baseline (bool): Whether to rescale the BERTScore with pre-computed baseline.
        baseline_path (str, optional): Customized baseline file.
        use_fast_tokenizer (bool): The `use_fast` parameter passed to HF tokenizer.

    Returns:
        Score: Inspect AI Score with the calculated evaluation results.
    """
    if reference is None:
        raise ValueError(
            "Reference is required for computing BERTScore, but was None."
        )

    predictions = [prediction]
    references = [reference]

    result = self.bertscore_module.compute(
        predictions=predictions,
        references=references,
        lang=self.lang,
        model_type=self.model_type,
        num_layers=self.num_layers,
        verbose=verbose,
        idf=self.idf,
        device=device,
        batch_size=batch_size,
        nthreads=nthreads,
        rescale_with_baseline=rescale_with_baseline,
        baseline_path=baseline_path,
        use_fast_tokenizer=use_fast_tokenizer,
    )
    return Score(
        value={
            "BERTScore Precision": result["precision"][0],  # type: ignore
            "BERTScore Recall": result["recall"][0],  # type: ignore
            "BERTScore F1": result["f1"][0],  # type: ignore
        },
        answer=prediction,
        metadata={
            "hashcode": result["hashcode"],  # type: ignore
        },
    )

calculate_async async

calculate_async(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    verbose=False,
    device=None,
    batch_size=64,
    nthreads=1,
    rescale_with_baseline=False,
    baseline_path=None,
    use_fast_tokenizer=False,
    **kwargs: dict,
) -> Score

Calculates BERTScore for the supplied model prediction and reference input.

Parameters:

Name Type Description Default
prediction str

The text of the prediction from the model.

required
input str | None

The text of the model input. Ignored for BERTScore.

None
reference str

The text of the reference input to compare against.

None
metadata dict[str, Any] | None

Metadata for the evaluation. Ignored for BERTScore.

None
verbose bool

Whether to turn on verbose mode.

False
device str | None

The device to use for computing the contextual embeddings.

None
batch_size int

The batch size to use for computing the contextual embeddings.

64
nthreads int

The number of threads to use for computing the contextual embeddings.

1
rescale_with_baseline bool

Whether to rescale the BERTScore with pre-computed baseline.

False
baseline_path str | None

Customized baseline file.

None
use_fast_tokenizer bool

The use_fast parameter passed to HF tokenizer.

False

Returns:

Name Type Description
Score Score

Inspect AI Score with the calculated evaluation results.

Source code in evalsense/evaluation/evaluators/bertscore.py
@override
async def calculate_async(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    verbose=False,
    device=None,
    batch_size=64,
    nthreads=1,
    rescale_with_baseline=False,
    baseline_path=None,
    use_fast_tokenizer=False,
    **kwargs: dict,
) -> Score:
    """
    Calculates BERTScore for the supplied model prediction and reference input.

    Args:
        prediction (str): The text of the prediction from the model.
        input (str | None): The text of the model input. Ignored for BERTScore.
        reference (str, optional): The text of the reference input to compare against.
        metadata (dict[str, Any] | None): Metadata for the evaluation. Ignored for BERTScore.
        verbose (bool): Whether to turn on verbose mode.
        device (str | None): The device to use for computing the contextual embeddings.
        batch_size (int): The batch size to use for computing the contextual embeddings.
        nthreads (int): The number of threads to use for computing the contextual embeddings.
        rescale_with_baseline (bool): Whether to rescale the BERTScore with pre-computed baseline.
        baseline_path (str | None): Customized baseline file.
        use_fast_tokenizer (bool): The `use_fast` parameter passed to HF tokenizer.

    Returns:
        Score: Inspect AI Score with the calculated evaluation results.
    """
    return self.calculate(
        prediction=prediction,
        reference=reference,
        verbose=verbose,
        device=device,
        batch_size=batch_size,
        nthreads=nthreads,
        rescale_with_baseline=rescale_with_baseline,
        baseline_path=baseline_path,
        use_fast_tokenizer=use_fast_tokenizer,
    )

BleuPrecisionScoreCalculator

Bases: ScoreCalculator

Calculator for computing BLEU scores.

Methods:

Name Description
calculate

Calculates BLEU precision scores for the supplied model prediction and reference input.

calculate_async

Calculates BLEU precision scores for the supplied model prediction and reference input.

Source code in evalsense/evaluation/evaluators/bleu.py
class BleuPrecisionScoreCalculator(ScoreCalculator):
    """Calculator for computing BLEU scores."""

    def __init__(self):
        self.bleu_module = evaluate.load("bleu")

    @override
    def calculate(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        **kwargs: dict,
    ) -> Score:
        """
        Calculates BLEU precision scores for the supplied model prediction and reference input.

        Args:
            prediction (str): The text of the prediction from the model.
            input (str, optional): The text of the input to the model. Ignored for BLEU.
            reference (str, optional): The text of the reference input to compare against.
            metadata (dict[str, Any], optional): Additional metadata for the score.
                Ignored for BLEU.

        Returns:
            Score: Inspect AI Score with the calculated evaluation results.
        """
        if reference is None:
            raise ValueError(
                "Reference is required for computing BLEU precision, but was None."
            )

        predictions = [prediction]
        references = [reference]

        result = self.bleu_module.compute(
            predictions=predictions, references=references
        )
        return Score(
            value=result["precisions"][0],  # type: ignore
            answer=prediction,
            metadata={
                "prediction": prediction,
                "reference": reference,
            },
        )

    @override
    async def calculate_async(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        **kwargs: dict,
    ) -> Score:
        """
        Calculates BLEU precision scores for the supplied model prediction and reference input.

        Args:
            prediction (str): The text of the prediction from the model.
            input (str, optional): The text of the input to the model. Ignored for BLEU.
            reference (str, optional): The text of the reference input to compare against.
            metadata (dict[str, Any], optional): Additional metadata for the score.
                Ignored for BLEU.

        Returns:
            Score: Inspect AI Score with the calculated evaluation results.
        """
        return self.calculate(
            prediction=prediction,
            reference=reference,
            input=input,
            metadata=metadata,
            **kwargs,
        )

calculate

calculate(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score

Calculates BLEU precision scores for the supplied model prediction and reference input.

Parameters:

Name Type Description Default
prediction str

The text of the prediction from the model.

required
input str

The text of the input to the model. Ignored for BLEU.

None
reference str

The text of the reference input to compare against.

None
metadata dict[str, Any]

Additional metadata for the score. Ignored for BLEU.

None

Returns:

Name Type Description
Score Score

Inspect AI Score with the calculated evaluation results.

Source code in evalsense/evaluation/evaluators/bleu.py
@override
def calculate(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score:
    """
    Calculates BLEU precision scores for the supplied model prediction and reference input.

    Args:
        prediction (str): The text of the prediction from the model.
        input (str, optional): The text of the input to the model. Ignored for BLEU.
        reference (str, optional): The text of the reference input to compare against.
        metadata (dict[str, Any], optional): Additional metadata for the score.
            Ignored for BLEU.

    Returns:
        Score: Inspect AI Score with the calculated evaluation results.
    """
    if reference is None:
        raise ValueError(
            "Reference is required for computing BLEU precision, but was None."
        )

    predictions = [prediction]
    references = [reference]

    result = self.bleu_module.compute(
        predictions=predictions, references=references
    )
    return Score(
        value=result["precisions"][0],  # type: ignore
        answer=prediction,
        metadata={
            "prediction": prediction,
            "reference": reference,
        },
    )

calculate_async async

calculate_async(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score

Calculates BLEU precision scores for the supplied model prediction and reference input.

Parameters:

Name Type Description Default
prediction str

The text of the prediction from the model.

required
input str

The text of the input to the model. Ignored for BLEU.

None
reference str

The text of the reference input to compare against.

None
metadata dict[str, Any]

Additional metadata for the score. Ignored for BLEU.

None

Returns:

Name Type Description
Score Score

Inspect AI Score with the calculated evaluation results.

Source code in evalsense/evaluation/evaluators/bleu.py
@override
async def calculate_async(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score:
    """
    Calculates BLEU precision scores for the supplied model prediction and reference input.

    Args:
        prediction (str): The text of the prediction from the model.
        input (str, optional): The text of the input to the model. Ignored for BLEU.
        reference (str, optional): The text of the reference input to compare against.
        metadata (dict[str, Any], optional): Additional metadata for the score.
            Ignored for BLEU.

    Returns:
        Score: Inspect AI Score with the calculated evaluation results.
    """
    return self.calculate(
        prediction=prediction,
        reference=reference,
        input=input,
        metadata=metadata,
        **kwargs,
    )

GEvalScoreCalculator

Bases: ScoreCalculator

G-Eval score calculator.

Methods:

Name Description
__init__

Initializes the G-Eval score calculator.

calculate

This method is not supported for G-Eval and will raise an error when called.

calculate_async

Calculates the G-Eval score asynchronously.

Source code in evalsense/evaluation/evaluators/g_eval.py
class GEvalScoreCalculator(ScoreCalculator):
    """G-Eval score calculator."""

    def __init__(
        self,
        model: Model,
        prompt_template: str,
        logprobs: bool = True,
        top_logprobs: int = 20,
        min_score: int = 1,
        max_score: int = 10,
        normalise: bool = True,
        debug: bool = False,
    ):
        """
        Initializes the G-Eval score calculator.

        Args:
            model (Model): The model to use for evaluation.
            prompt_template (str): The prompt template with the scoring instructions.
            logprobs (bool): Whether to use model log probabilities to compute weighted
                evaluation score instead of a standard score.
            top_logprobs (int): The number of top log probabilities to consider.
            min_score (int): The minimum valid score.
            max_score (int): The maximum valid score.
            normalise (bool): Whether to normalise the scores between 0 and 1.
            debug (bool): Whether to report repeated errors in the log.
        """
        self.model = model
        self.prompt_template = prompt_template
        self.logprobs = logprobs
        self.top_logprobs = top_logprobs
        self.min_score = min_score
        self.max_score = max_score
        self.normalise = normalise
        self.debug = debug
        self.warned_weighted_score = False

    @override
    def calculate(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        **kwargs: dict,
    ) -> Score:
        """This method is not supported for G-Eval and will raise an error when called.

        Use `calculate_async` instead.

        Raises:
            NotImplementedError: When called, as synchronous evaluation is not
                supported for G-Eval.
        """
        raise NotImplementedError(
            "Synchronous evaluation is not supported for G-Eval. "
            "Use calculate_async instead."
        )

    @override
    async def calculate_async(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        **kwargs: dict,
    ) -> Score:
        """Calculates the G-Eval score asynchronously.

        Args:
            prediction (str): The predicted output to evaluate.
            input (str | None): The input text for the model. Defaults to `None`.
            reference (str | None): The reference text for the model. Defaults to `None`.
            metadata (dict[str, Any] | None): Additional metadata for the evaluation.
                Defaults to `None`.
            **kwargs: Additional keyword arguments.

        Returns:
            Score: The calculated score.
        """
        logprobs_config = GenerateConfig(
            logprobs=self.logprobs,
            top_logprobs=self.top_logprobs,
        )
        if metadata is None:
            metadata = {}
        llm_input = format_template(
            self.prompt_template,
            prediction=prediction,
            reference=reference,
            input=input,
            **metadata,
        )
        output = await self.model.generate(llm_input, config=logprobs_config)

        raw_score = extract_score(output.completion, self.min_score, self.max_score)
        if self.logprobs:
            try:
                raw_score = extract_weighted_score(
                    output, min_score=self.min_score, max_score=self.max_score
                )
            except ValueError as e:
                if not self.warned_weighted_score or self.debug:
                    self.warned_weighted_score = True

                    error_message = (
                        f"❌  Cannot compute weighted evaluation score: {e} "
                        "Falling back to standard score."
                    )

                    if not self.debug:
                        error_message += (
                            " Further errors will be suppressed "
                            + "(set debug=True to see all errors)."
                        )
                    error_message += f" Offending output: {output.completion}"

                    logger.error(error_message)

        if self.normalise:
            score = (raw_score - self.min_score) / (self.max_score - self.min_score)
        else:
            score = raw_score

        return Score(
            value=score,
            answer=prediction,
            metadata={
                "prompt": llm_input,
                "output_text": output.completion,
                "raw_score": raw_score,
            },
        )

__init__

__init__(
    model: Model,
    prompt_template: str,
    logprobs: bool = True,
    top_logprobs: int = 20,
    min_score: int = 1,
    max_score: int = 10,
    normalise: bool = True,
    debug: bool = False,
)

Initializes the G-Eval score calculator.

Parameters:

Name Type Description Default
model Model

The model to use for evaluation.

required
prompt_template str

The prompt template with the scoring instructions.

required
logprobs bool

Whether to use model log probabilities to compute weighted evaluation score instead of a standard score.

True
top_logprobs int

The number of top log probabilities to consider.

20
min_score int

The minimum valid score.

1
max_score int

The maximum valid score.

10
normalise bool

Whether to normalise the scores between 0 and 1.

True
debug bool

Whether to report repeated errors in the log.

False
Source code in evalsense/evaluation/evaluators/g_eval.py
def __init__(
    self,
    model: Model,
    prompt_template: str,
    logprobs: bool = True,
    top_logprobs: int = 20,
    min_score: int = 1,
    max_score: int = 10,
    normalise: bool = True,
    debug: bool = False,
):
    """
    Initializes the G-Eval score calculator.

    Args:
        model (Model): The model to use for evaluation.
        prompt_template (str): The prompt template with the scoring instructions.
        logprobs (bool): Whether to use model log probabilities to compute weighted
            evaluation score instead of a standard score.
        top_logprobs (int): The number of top log probabilities to consider.
        min_score (int): The minimum valid score.
        max_score (int): The maximum valid score.
        normalise (bool): Whether to normalise the scores between 0 and 1.
        debug (bool): Whether to report repeated errors in the log.
    """
    self.model = model
    self.prompt_template = prompt_template
    self.logprobs = logprobs
    self.top_logprobs = top_logprobs
    self.min_score = min_score
    self.max_score = max_score
    self.normalise = normalise
    self.debug = debug
    self.warned_weighted_score = False

calculate

calculate(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score

This method is not supported for G-Eval and will raise an error when called.

Use calculate_async instead.

Raises:

Type Description
NotImplementedError

When called, as synchronous evaluation is not supported for G-Eval.

Source code in evalsense/evaluation/evaluators/g_eval.py
@override
def calculate(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score:
    """This method is not supported for G-Eval and will raise an error when called.

    Use `calculate_async` instead.

    Raises:
        NotImplementedError: When called, as synchronous evaluation is not
            supported for G-Eval.
    """
    raise NotImplementedError(
        "Synchronous evaluation is not supported for G-Eval. "
        "Use calculate_async instead."
    )

calculate_async async

calculate_async(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score

Calculates the G-Eval score asynchronously.

Parameters:

Name Type Description Default
prediction str

The predicted output to evaluate.

required
input str | None

The input text for the model. Defaults to None.

None
reference str | None

The reference text for the model. Defaults to None.

None
metadata dict[str, Any] | None

Additional metadata for the evaluation. Defaults to None.

None
**kwargs dict

Additional keyword arguments.

{}

Returns:

Name Type Description
Score Score

The calculated score.

Source code in evalsense/evaluation/evaluators/g_eval.py
@override
async def calculate_async(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score:
    """Calculates the G-Eval score asynchronously.

    Args:
        prediction (str): The predicted output to evaluate.
        input (str | None): The input text for the model. Defaults to `None`.
        reference (str | None): The reference text for the model. Defaults to `None`.
        metadata (dict[str, Any] | None): Additional metadata for the evaluation.
            Defaults to `None`.
        **kwargs: Additional keyword arguments.

    Returns:
        Score: The calculated score.
    """
    logprobs_config = GenerateConfig(
        logprobs=self.logprobs,
        top_logprobs=self.top_logprobs,
    )
    if metadata is None:
        metadata = {}
    llm_input = format_template(
        self.prompt_template,
        prediction=prediction,
        reference=reference,
        input=input,
        **metadata,
    )
    output = await self.model.generate(llm_input, config=logprobs_config)

    raw_score = extract_score(output.completion, self.min_score, self.max_score)
    if self.logprobs:
        try:
            raw_score = extract_weighted_score(
                output, min_score=self.min_score, max_score=self.max_score
            )
        except ValueError as e:
            if not self.warned_weighted_score or self.debug:
                self.warned_weighted_score = True

                error_message = (
                    f"❌  Cannot compute weighted evaluation score: {e} "
                    "Falling back to standard score."
                )

                if not self.debug:
                    error_message += (
                        " Further errors will be suppressed "
                        + "(set debug=True to see all errors)."
                    )
                error_message += f" Offending output: {output.completion}"

                logger.error(error_message)

    if self.normalise:
        score = (raw_score - self.min_score) / (self.max_score - self.min_score)
    else:
        score = raw_score

    return Score(
        value=score,
        answer=prediction,
        metadata={
            "prompt": llm_input,
            "output_text": output.completion,
            "raw_score": raw_score,
        },
    )

GEvalScorerFactory

Bases: ScorerFactory

Scorer factory for G-Eval.

Methods:

Name Description
__init__

Initialize the G-Eval scorer factory.

create_scorer

Creates a G-Eval scorer.

Source code in evalsense/evaluation/evaluators/g_eval.py
class GEvalScorerFactory(ScorerFactory):
    """Scorer factory for G-Eval."""

    def __init__(
        self,
        name: str,
        prompt_template: str,
        metrics: list[Metric | dict[str, list[Metric]]]
        | dict[str, list[Metric]]
        | None = None,
        logprobs: bool = True,
        top_logprobs: int = 20,
        min_score: int = 1,
        max_score: int = 10,
        normalise: bool = True,
        debug: bool = False,
    ):
        """
        Initialize the G-Eval scorer factory.

        Args:
            name (str): The name of the scorer.
            prompt_template (str): The prompt template with the scoring instructions.
            metrics (list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None):
                The metrics to use for the evaluation. If `None`, the default metric
                will be used (G-Eval with mean aggregation).
            logprobs (bool): Whether to use model log probabilities to compute weighted
                evaluation score instead of a standard score.
            top_logprobs (int): The number of top log probabilities to consider.
            min_score (int): The minimum valid score.
            max_score (int): The maximum valid score.
            normalise (bool): Whether to normalise the scores between 0 and 1.
            debug (bool): Whether to report repeated errors in the log.
        """
        self.name = name
        self.prompt_template = prompt_template
        if metrics is None:
            metrics = [mean()]
        self.metrics = metrics
        self.logprobs = logprobs
        self.top_logprobs = top_logprobs
        self.min_score = min_score
        self.max_score = max_score
        self.normalise = normalise
        self.debug = debug

    @override
    def create_scorer(self, model: Model) -> Scorer:
        """
        Creates a G-Eval scorer.

        Args:
            model (Model): The model to create a scorer for.

        Returns:
            Scorer: The created G-Eval scorer.
        """

        @scorer(name=self.name, metrics=self.metrics)
        def g_eval_scorer() -> Scorer:
            g_eval_calculator = GEvalScoreCalculator(
                model=model,
                prompt_template=self.prompt_template,
                logprobs=self.logprobs,
                top_logprobs=self.top_logprobs,
                min_score=self.min_score,
                max_score=self.max_score,
                normalise=self.normalise,
                debug=self.debug,
            )

            async def score(state: TaskState, target: Target):
                return await g_eval_calculator.calculate_async(
                    input=state.input_text,
                    prediction=state.output.completion,
                    reference=target.text,
                    metadata=state.metadata,
                )

            return score

        return g_eval_scorer()

__init__

__init__(
    name: str,
    prompt_template: str,
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
    logprobs: bool = True,
    top_logprobs: int = 20,
    min_score: int = 1,
    max_score: int = 10,
    normalise: bool = True,
    debug: bool = False,
)

Initialize the G-Eval scorer factory.

Parameters:

Name Type Description Default
name str

The name of the scorer.

required
prompt_template str

The prompt template with the scoring instructions.

required
metrics list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None

The metrics to use for the evaluation. If None, the default metric will be used (G-Eval with mean aggregation).

None
logprobs bool

Whether to use model log probabilities to compute weighted evaluation score instead of a standard score.

True
top_logprobs int

The number of top log probabilities to consider.

20
min_score int

The minimum valid score.

1
max_score int

The maximum valid score.

10
normalise bool

Whether to normalise the scores between 0 and 1.

True
debug bool

Whether to report repeated errors in the log.

False
Source code in evalsense/evaluation/evaluators/g_eval.py
def __init__(
    self,
    name: str,
    prompt_template: str,
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
    logprobs: bool = True,
    top_logprobs: int = 20,
    min_score: int = 1,
    max_score: int = 10,
    normalise: bool = True,
    debug: bool = False,
):
    """
    Initialize the G-Eval scorer factory.

    Args:
        name (str): The name of the scorer.
        prompt_template (str): The prompt template with the scoring instructions.
        metrics (list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None):
            The metrics to use for the evaluation. If `None`, the default metric
            will be used (G-Eval with mean aggregation).
        logprobs (bool): Whether to use model log probabilities to compute weighted
            evaluation score instead of a standard score.
        top_logprobs (int): The number of top log probabilities to consider.
        min_score (int): The minimum valid score.
        max_score (int): The maximum valid score.
        normalise (bool): Whether to normalise the scores between 0 and 1.
        debug (bool): Whether to report repeated errors in the log.
    """
    self.name = name
    self.prompt_template = prompt_template
    if metrics is None:
        metrics = [mean()]
    self.metrics = metrics
    self.logprobs = logprobs
    self.top_logprobs = top_logprobs
    self.min_score = min_score
    self.max_score = max_score
    self.normalise = normalise
    self.debug = debug

create_scorer

create_scorer(model: Model) -> Scorer

Creates a G-Eval scorer.

Parameters:

Name Type Description Default
model Model

The model to create a scorer for.

required

Returns:

Name Type Description
Scorer Scorer

The created G-Eval scorer.

Source code in evalsense/evaluation/evaluators/g_eval.py
@override
def create_scorer(self, model: Model) -> Scorer:
    """
    Creates a G-Eval scorer.

    Args:
        model (Model): The model to create a scorer for.

    Returns:
        Scorer: The created G-Eval scorer.
    """

    @scorer(name=self.name, metrics=self.metrics)
    def g_eval_scorer() -> Scorer:
        g_eval_calculator = GEvalScoreCalculator(
            model=model,
            prompt_template=self.prompt_template,
            logprobs=self.logprobs,
            top_logprobs=self.top_logprobs,
            min_score=self.min_score,
            max_score=self.max_score,
            normalise=self.normalise,
            debug=self.debug,
        )

        async def score(state: TaskState, target: Target):
            return await g_eval_calculator.calculate_async(
                input=state.input_text,
                prediction=state.output.completion,
                reference=target.text,
                metadata=state.metadata,
            )

        return score

    return g_eval_scorer()

QagsConfig

Bases: Protocol

A protocol for configuring QAGS evaluation.

Methods:

Name Description
__init__

Initializes the QAGS configuration.

enforce_not_none

Helper method to enforce that a parameter is not None.

get_answer_comparison_prompt

Constructs the prompt for comparing answers to the generated questions.

get_answer_generation_prompt

Constructs the prompt for generating the answer to a single question.

get_question_generation_prompt

Constructs the prompt for generating the questions for the model output.

Source code in evalsense/evaluation/evaluators/qags.py
class QagsConfig(Protocol):
    """A protocol for configuring QAGS evaluation."""

    answer_comparison_mode: Literal["ternary", "exact", "judge"]
    logprobs: bool
    top_logprobs: int
    ci: float
    debug: bool

    def __init__(
        self,
        answer_comparison_mode: Literal["ternary", "exact", "judge"],
        logprobs: bool = True,
        top_logprobs: int = 20,
        ci: float = 0.1,
        debug: bool = False,
    ):
        """
        Initializes the QAGS configuration.

        Args:
            answer_comparison_mode (Literal["ternary", "exact", "judge"]): The mode
                for comparing answers. Either "ternary", "exact", or "judge".
                In "ternary" mode, the model is expected to answer the generated
                questions with "yes", "no", or "unknown". In other modes, the model
                may give arbitrary answers, which are either compared in terms
                of exact match or compared by the model itself.
            logprobs (bool): Whether to use logprobs to compute weighted answers. Can only
                be used when `answer_comparison_mode` is set to "judge".
            top_logprobs (int): The number of top log probabilities to consider
                when computing weighted answers.
            ci (float): The range near the extreme values (0.0 or 1.0) in which
                to consider the model answer as confident when comparing answers.
                This only affects the score explanation when `answer_comparison_mode`
                is set to "judge". The default value is 0.1, which means that
                answers with a score of 0.9 or are confident "yes", while answers
                with a score of 0.1 or lower are confident "no".
            debug (bool): Whether to report repeated errors in the log.
        """
        self.answer_comparison_mode = answer_comparison_mode
        self.logprobs = logprobs
        self.top_logprobs = top_logprobs
        self.ci = ci
        self.debug = debug

    def enforce_not_none[T](self, param_name: str, param_value: T | None) -> T:
        """
        Helper method to enforce that a parameter is not None.

        Args:
            param_name (str): The name of the parameter.
            param_value (T | None): The value of the parameter.

        Raises:
            ValueError: If the parameter value is None.

        Returns:
            T: The parameter value if it is not None.
        """
        if param_value is None:
            raise ValueError(f"{param_name} cannot be None.")
        return param_value

    @abstractmethod
    def get_question_generation_prompt(
        self,
        *,
        source: Literal["prediction", "reference"],
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> str:
        """
        Constructs the prompt for generating the questions for the model output.

        The prompt should instruct the model to generate each question on
        a separate line.

        Args:
            source (Literal["prediction", "reference"]): The source to use for
                generating the questions. Either "prediction" or "reference".
                According to the source, the generated prompt should either use
                the model output or the reference output/input. When
                `answer_comparison_mode` is set to "ternary", the generated
                questions should be answerable with "yes", "no", or "unknown".
            prediction (str, optional): The model output to evaluate.
            input (str, optional): The input to the model. Optional.
            reference (str, optional): The reference output to compare against.
                Optional.
            metadata (dict[str, Any], optional): Additional Inspect AI sample/task
                state metadata. Optional.

        Returns:
            str: The generated prompt.
        """
        ...

    @abstractmethod
    def get_answer_generation_prompt(
        self,
        *,
        source: Literal["prediction", "reference"],
        question: str,
        prediction: str | None = None,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> str:
        """
        Constructs the prompt for generating the answer to a single question.

        Args:
            source (Literal["prediction", "reference"]): The source to use for
                generating the answer. Either "prediction" or "reference".
                According to the source, the generated prompt should either use
                the model output or the reference output/input when asking
                the model to answer the question. When `answer_comparison_mode`
                is set to "ternary", the prompt should instruct the model to
                answer the question with "yes", "no", or "unknown". Otherwise,
                the model should be instructed to give an answer only without
                any further comments.
            prediction (str, optional): The model output to evaluate.
            input (str, optional): The input to the model. Optional.
            reference (str, optional): The reference output to compare against.
                Optional.
            metadata (dict[str, Any], optional): Additional Inspect AI sample/task
                state metadata. Optional.

        Returns:
            str: The generated prompt.
        """
        ...

    def get_answer_comparison_prompt(
        self,
        *,
        question: str,
        prediction_answer: str,
        reference_answer: str,
        input: str | None = None,
        prediction: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> str:
        """
        Constructs the prompt for comparing answers to the generated questions.

        This method is only used when `answer_comparison_mode` is set to "judge".

        Args:
            question (str): The question to compare answers for.
            prediction_answer (str): The answer generated from the model output.
            reference_answer (str): The answer generated from the reference output.
            input (str | None, optional): The input to the model. Optional.
            prediction (str | None, optional): The model output to evaluate. Optional.
            reference (str | None, optional): The reference output to compare against.
                Optional.
            metadata (dict[str, Any] | None, optional): Additional Inspect AI sample/task
                state metadata. Optional.

        Returns:
            str: The generated prompt.
        """
        if self.answer_comparison_mode == "judge":
            raise NotImplementedError(
                "Answer comparison prompt generation is not implemented. "
                "If you want to use QAGS in judge mode, please implement this method."
            )
        assert False, (
            "Should not attempt to generate comparison prompt in non-judge mode."
        )

__init__

__init__(
    answer_comparison_mode: Literal[
        "ternary", "exact", "judge"
    ],
    logprobs: bool = True,
    top_logprobs: int = 20,
    ci: float = 0.1,
    debug: bool = False,
)

Initializes the QAGS configuration.

Parameters:

Name Type Description Default
answer_comparison_mode Literal['ternary', 'exact', 'judge']

The mode for comparing answers. Either "ternary", "exact", or "judge". In "ternary" mode, the model is expected to answer the generated questions with "yes", "no", or "unknown". In other modes, the model may give arbitrary answers, which are either compared in terms of exact match or compared by the model itself.

required
logprobs bool

Whether to use logprobs to compute weighted answers. Can only be used when answer_comparison_mode is set to "judge".

True
top_logprobs int

The number of top log probabilities to consider when computing weighted answers.

20
ci float

The range near the extreme values (0.0 or 1.0) in which to consider the model answer as confident when comparing answers. This only affects the score explanation when answer_comparison_mode is set to "judge". The default value is 0.1, which means that answers with a score of 0.9 or are confident "yes", while answers with a score of 0.1 or lower are confident "no".

0.1
debug bool

Whether to report repeated errors in the log.

False
Source code in evalsense/evaluation/evaluators/qags.py
def __init__(
    self,
    answer_comparison_mode: Literal["ternary", "exact", "judge"],
    logprobs: bool = True,
    top_logprobs: int = 20,
    ci: float = 0.1,
    debug: bool = False,
):
    """
    Initializes the QAGS configuration.

    Args:
        answer_comparison_mode (Literal["ternary", "exact", "judge"]): The mode
            for comparing answers. Either "ternary", "exact", or "judge".
            In "ternary" mode, the model is expected to answer the generated
            questions with "yes", "no", or "unknown". In other modes, the model
            may give arbitrary answers, which are either compared in terms
            of exact match or compared by the model itself.
        logprobs (bool): Whether to use logprobs to compute weighted answers. Can only
            be used when `answer_comparison_mode` is set to "judge".
        top_logprobs (int): The number of top log probabilities to consider
            when computing weighted answers.
        ci (float): The range near the extreme values (0.0 or 1.0) in which
            to consider the model answer as confident when comparing answers.
            This only affects the score explanation when `answer_comparison_mode`
            is set to "judge". The default value is 0.1, which means that
            answers with a score of 0.9 or are confident "yes", while answers
            with a score of 0.1 or lower are confident "no".
        debug (bool): Whether to report repeated errors in the log.
    """
    self.answer_comparison_mode = answer_comparison_mode
    self.logprobs = logprobs
    self.top_logprobs = top_logprobs
    self.ci = ci
    self.debug = debug

enforce_not_none

enforce_not_none(
    param_name: str, param_value: T | None
) -> T

Helper method to enforce that a parameter is not None.

Parameters:

Name Type Description Default
param_name str

The name of the parameter.

required
param_value T | None

The value of the parameter.

required

Raises:

Type Description
ValueError

If the parameter value is None.

Returns:

Name Type Description
T T

The parameter value if it is not None.

Source code in evalsense/evaluation/evaluators/qags.py
def enforce_not_none[T](self, param_name: str, param_value: T | None) -> T:
    """
    Helper method to enforce that a parameter is not None.

    Args:
        param_name (str): The name of the parameter.
        param_value (T | None): The value of the parameter.

    Raises:
        ValueError: If the parameter value is None.

    Returns:
        T: The parameter value if it is not None.
    """
    if param_value is None:
        raise ValueError(f"{param_name} cannot be None.")
    return param_value

get_answer_comparison_prompt

get_answer_comparison_prompt(
    *,
    question: str,
    prediction_answer: str,
    reference_answer: str,
    input: str | None = None,
    prediction: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
) -> str

Constructs the prompt for comparing answers to the generated questions.

This method is only used when answer_comparison_mode is set to "judge".

Parameters:

Name Type Description Default
question str

The question to compare answers for.

required
prediction_answer str

The answer generated from the model output.

required
reference_answer str

The answer generated from the reference output.

required
input str | None

The input to the model. Optional.

None
prediction str | None

The model output to evaluate. Optional.

None
reference str | None

The reference output to compare against. Optional.

None
metadata dict[str, Any] | None

Additional Inspect AI sample/task state metadata. Optional.

None

Returns:

Name Type Description
str str

The generated prompt.

Source code in evalsense/evaluation/evaluators/qags.py
def get_answer_comparison_prompt(
    self,
    *,
    question: str,
    prediction_answer: str,
    reference_answer: str,
    input: str | None = None,
    prediction: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
) -> str:
    """
    Constructs the prompt for comparing answers to the generated questions.

    This method is only used when `answer_comparison_mode` is set to "judge".

    Args:
        question (str): The question to compare answers for.
        prediction_answer (str): The answer generated from the model output.
        reference_answer (str): The answer generated from the reference output.
        input (str | None, optional): The input to the model. Optional.
        prediction (str | None, optional): The model output to evaluate. Optional.
        reference (str | None, optional): The reference output to compare against.
            Optional.
        metadata (dict[str, Any] | None, optional): Additional Inspect AI sample/task
            state metadata. Optional.

    Returns:
        str: The generated prompt.
    """
    if self.answer_comparison_mode == "judge":
        raise NotImplementedError(
            "Answer comparison prompt generation is not implemented. "
            "If you want to use QAGS in judge mode, please implement this method."
        )
    assert False, (
        "Should not attempt to generate comparison prompt in non-judge mode."
    )

get_answer_generation_prompt abstractmethod

get_answer_generation_prompt(
    *,
    source: Literal["prediction", "reference"],
    question: str,
    prediction: str | None = None,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
) -> str

Constructs the prompt for generating the answer to a single question.

Parameters:

Name Type Description Default
source Literal['prediction', 'reference']

The source to use for generating the answer. Either "prediction" or "reference". According to the source, the generated prompt should either use the model output or the reference output/input when asking the model to answer the question. When answer_comparison_mode is set to "ternary", the prompt should instruct the model to answer the question with "yes", "no", or "unknown". Otherwise, the model should be instructed to give an answer only without any further comments.

required
prediction str

The model output to evaluate.

None
input str

The input to the model. Optional.

None
reference str

The reference output to compare against. Optional.

None
metadata dict[str, Any]

Additional Inspect AI sample/task state metadata. Optional.

None

Returns:

Name Type Description
str str

The generated prompt.

Source code in evalsense/evaluation/evaluators/qags.py
@abstractmethod
def get_answer_generation_prompt(
    self,
    *,
    source: Literal["prediction", "reference"],
    question: str,
    prediction: str | None = None,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
) -> str:
    """
    Constructs the prompt for generating the answer to a single question.

    Args:
        source (Literal["prediction", "reference"]): The source to use for
            generating the answer. Either "prediction" or "reference".
            According to the source, the generated prompt should either use
            the model output or the reference output/input when asking
            the model to answer the question. When `answer_comparison_mode`
            is set to "ternary", the prompt should instruct the model to
            answer the question with "yes", "no", or "unknown". Otherwise,
            the model should be instructed to give an answer only without
            any further comments.
        prediction (str, optional): The model output to evaluate.
        input (str, optional): The input to the model. Optional.
        reference (str, optional): The reference output to compare against.
            Optional.
        metadata (dict[str, Any], optional): Additional Inspect AI sample/task
            state metadata. Optional.

    Returns:
        str: The generated prompt.
    """
    ...

get_question_generation_prompt abstractmethod

get_question_generation_prompt(
    *,
    source: Literal["prediction", "reference"],
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
) -> str

Constructs the prompt for generating the questions for the model output.

The prompt should instruct the model to generate each question on a separate line.

Parameters:

Name Type Description Default
source Literal['prediction', 'reference']

The source to use for generating the questions. Either "prediction" or "reference". According to the source, the generated prompt should either use the model output or the reference output/input. When answer_comparison_mode is set to "ternary", the generated questions should be answerable with "yes", "no", or "unknown".

required
prediction str

The model output to evaluate.

required
input str

The input to the model. Optional.

None
reference str

The reference output to compare against. Optional.

None
metadata dict[str, Any]

Additional Inspect AI sample/task state metadata. Optional.

None

Returns:

Name Type Description
str str

The generated prompt.

Source code in evalsense/evaluation/evaluators/qags.py
@abstractmethod
def get_question_generation_prompt(
    self,
    *,
    source: Literal["prediction", "reference"],
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
) -> str:
    """
    Constructs the prompt for generating the questions for the model output.

    The prompt should instruct the model to generate each question on
    a separate line.

    Args:
        source (Literal["prediction", "reference"]): The source to use for
            generating the questions. Either "prediction" or "reference".
            According to the source, the generated prompt should either use
            the model output or the reference output/input. When
            `answer_comparison_mode` is set to "ternary", the generated
            questions should be answerable with "yes", "no", or "unknown".
        prediction (str, optional): The model output to evaluate.
        input (str, optional): The input to the model. Optional.
        reference (str, optional): The reference output to compare against.
            Optional.
        metadata (dict[str, Any], optional): Additional Inspect AI sample/task
            state metadata. Optional.

    Returns:
        str: The generated prompt.
    """
    ...

QagsScoreCalculator

Bases: ScoreCalculator

QAGS score calculator.

Methods:

Name Description
__init__

Initializes the QAGS score calculator.

calculate

This method is not supported for QAGS and will raise an error when called.

calculate_async

Asynchronously computes evaluation scores for QAGS.

Attributes:

Name Type Description
generate_config GenerateConfig

Generation configuration for the model.

Source code in evalsense/evaluation/evaluators/qags.py
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
class QagsScoreCalculator(ScoreCalculator):
    """QAGS score calculator."""

    _symbol_dict: dict[bool | None, str] = {
        True: "✅",
        False: "❌",
        None: "❓",
    }

    def __init__(
        self,
        model: Model,
        config: QagsConfig,
        name: str = "QAGS",
        debug: bool = False,
    ):
        """
        Initializes the QAGS score calculator.

        Args:
            model (Model): The model to use for evaluation.
            config (QagsConfig): The configuration for the QAGS score calculator.
            name (str): The name of the score calculator. Defaults to "QAGS".
            debug (bool): Whether to report repeated errors in the log.
        """
        self.model = model
        self.config = config
        self.name = name
        self.warned_weighted_answer = False

    @property
    def generate_config(self) -> GenerateConfig:
        """Generation configuration for the model."""
        if self.config.logprobs and self.config.answer_comparison_mode == "judge":
            return GenerateConfig(
                logprobs=self.config.logprobs,
                top_logprobs=self.config.top_logprobs,
            )
        return GenerateConfig()

    @override
    def calculate(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        **kwargs: dict,
    ) -> Score:
        """This method is not supported for QAGS and will raise an error when called.

        Use `calculate_async` instead.

        Raises:
            NotImplementedError: When called, as synchronous evaluation is not
                supported for QAGS.
        """
        raise NotImplementedError(
            "Synchronous evaluation is not supported for QAGS. "
            "Use calculate_async instead."
        )

    async def _generate_questions(
        self,
        *,
        prediction: str,
        score_metadata: dict[str, Any],
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> list[str]:
        """Generates questions for the model output and reference output.

        Args:
            prediction (str): The model output to evaluate.
            score_metadata (dict[str, Any]): The dictionary for storing metadata
                associated with the evaluation, returned with the score.
            input (str | None, optional): The input to the model. Optional.
            reference (str | None, optional): The reference output to compare against.
                Optional.
            metadata (dict[str, Any] | None, optional): Additional Inspect AI sample/task
                state metadata. Optional.

        Returns:
            list[str]: A list of generated questions.
        """
        # Questions for model outputs
        prediction_questions_prompt = self.config.get_question_generation_prompt(
            source="prediction",
            prediction=prediction,
            input=input,
            reference=reference,
            metadata=metadata,
        )
        # We don't actually need the logprobs until comparing the answers,
        # but the vLLM provider uses the config from the first sample in the batch
        # so we need to use consistent config for all samples.
        prediction_questions_output = await self.model.generate(
            prediction_questions_prompt, config=self.generate_config
        )
        prediction_questions = extract_lines(
            prediction_questions_output.completion,
            include_filter_fun=lambda line: line.endswith("?"),
        )

        # Questions for reference outputs
        reference_questions_prompt = self.config.get_question_generation_prompt(
            source="reference",
            prediction=prediction,
            input=input,
            reference=reference,
            metadata=metadata,
        )
        reference_questions_output = await self.model.generate(
            reference_questions_prompt, config=self.generate_config
        )
        reference_questions = extract_lines(
            reference_questions_output.completion,
            include_filter_fun=lambda line: line.endswith("?"),
        )

        questions = prediction_questions + reference_questions

        score_metadata["questions"] = questions
        score_metadata["prediction_questions_prompt"] = prediction_questions_prompt
        score_metadata["reference_questions_prompt"] = reference_questions_prompt
        score_metadata["raw_prediction_questions"] = (
            prediction_questions_output.completion
        )
        score_metadata["raw_reference_questions"] = (
            reference_questions_output.completion
        )

        return questions

    async def _generate_answers(
        self,
        *,
        prediction: str,
        score_metadata: dict[str, Any],
        questions: list[str],
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> tuple[list[str], list[str]]:
        """Generates answers for the model output and reference output.

        Args:
            prediction (str): The model output to evaluate.
            score_metadata (dict[str, Any]): The dictionary for storing metadata
                associated with the evaluation, returned with the score.
            questions (list[str]): The list of questions to generate answers for.
            input (str | None, optional): The input to the model. Optional.
            reference (str | None, optional): The reference output to compare against.
                Optional.
            metadata (dict[str, Any] | None, optional): Additional Inspect AI sample/task
                state metadata. Optional.

        Returns:
            tuple[list[str], list[str]]: A tuple containing two lists of generated
                answers - one for the model output and one for the reference output,
                respectively.
        """
        prediction_answers: list[str] = []
        reference_answers: list[str] = []
        score_metadata["raw_prediction_answers"] = []
        score_metadata["raw_reference_answers"] = []
        score_metadata["prediction_answer_prompts"] = []
        score_metadata["reference_answer_prompts"] = []
        for question in questions:
            prediction_answer_prompt = self.config.get_answer_generation_prompt(
                source="prediction",
                question=question,
                prediction=prediction,
                input=input,
                reference=reference,
                metadata=metadata,
            )
            prediction_answer_output = await self.model.generate(
                prediction_answer_prompt, config=self.generate_config
            )
            prediction_answers.append(prediction_answer_output.completion)

            reference_answer_prompt = self.config.get_answer_generation_prompt(
                source="reference",
                question=question,
                prediction=prediction,
                input=input,
                reference=reference,
                metadata=metadata,
            )
            reference_answer_output = await self.model.generate(
                reference_answer_prompt, config=self.generate_config
            )
            reference_answers.append(reference_answer_output.completion)

            score_metadata["raw_prediction_answers"].append(
                prediction_answer_output.completion
            )
            score_metadata["raw_reference_answers"].append(
                reference_answer_output.completion
            )
            score_metadata["prediction_answer_prompts"].append(prediction_answer_prompt)
            score_metadata["reference_answer_prompts"].append(reference_answer_prompt)

        return prediction_answers, reference_answers

    def _evaluate_ternary_answers(
        self,
        *,
        prediction: str,
        questions: list[str],
        raw_prediction_answers: list[str],
        raw_reference_answers: list[str],
        score_metadata: dict[str, Any],
    ) -> Score:
        """Evaluates the answers using the ternary answer comparison mode.

        Args:
            prediction (str): The model output to evaluate.
            questions (list[str]): The list of questions generated for the model
                output.
            raw_prediction_answers (list[str]): The list of answers generated from
                the model output.
            raw_reference_answers (list[str]): The list of answers generated from
                the reference output.
            score_metadata (dict[str, Any]): The dictionary for storing metadata
                associated with the evaluation, returned with the score.

        Returns:
            Score: The Inspect AI Score object with the calculated result.
        """
        prediction_answers = [
            extract_ternary_answer(answer, binary_only=False, unknown_on_mismatch=True)
            for answer in raw_prediction_answers
        ]
        reference_answers = [
            extract_ternary_answer(answer, binary_only=False, unknown_on_mismatch=True)
            for answer in raw_reference_answers
        ]

        ref_positive = sum([ra is True for ra in reference_answers])
        pred_positive = sum([pa is True for pa in prediction_answers])
        true_positive = sum(
            [
                pa == ra and ra is True
                for pa, ra in zip(prediction_answers, reference_answers)
            ]
        )
        total_correct = sum(
            [pa == ra for pa, ra in zip(prediction_answers, reference_answers)]
        )

        coverage = true_positive / ref_positive if ref_positive > 0 else 0.0
        groundedness = true_positive / pred_positive if pred_positive > 0 else 0.0
        accuracy = (
            total_correct / len(prediction_answers)
            if len(prediction_answers) > 0
            else 0.0
        )

        explanation = "QAGS Evaluation Report\n\n\nMismatched Q&As\n"
        for i, (question, pa, ra) in enumerate(
            zip(questions, prediction_answers, reference_answers)
        ):
            if pa == ra:
                continue
            explanation += (
                f"* [{i}] Q: {question}, PA: {self._symbol_dict.get(pa)}, "
                f"RA: {self._symbol_dict.get(ra)}, "
                f"Match: {self._symbol_dict.get(False)}\n"
            )
        explanation += "\n\nAll Q&As\n"
        for i, (question, pa, ra) in enumerate(
            zip(questions, prediction_answers, reference_answers)
        ):
            explanation += (
                f"* [{i}] Q: {question}, PA: {self._symbol_dict.get(pa)}, "
                f"RA: {self._symbol_dict.get(ra)}, "
                f"Match: {self._symbol_dict.get(pa == ra)}\n"
            )
        explanation += (
            "\n\n"
            + f"Coverage: {coverage:.2f} ({true_positive}/{ref_positive})\n"
            + f"Groundedness: {groundedness:.2f} ({true_positive}/{pred_positive})\n"
            + f"Accuracy: {accuracy:.2f} ({total_correct}/{len(prediction_answers)})"
        )

        score_metadata["prediction_answers"] = prediction_answers
        score_metadata["reference_answers"] = reference_answers
        score_metadata = {"explanation": explanation} | score_metadata

        return Score(
            value={
                f"{self.name} Coverage": coverage,
                f"{self.name} Groundedness": groundedness,
                f"{self.name} Accuracy": accuracy,
            },
            answer=prediction,
            explanation=explanation,
            metadata=score_metadata,
        )

    def _evaluate_exact_answers(
        self,
        *,
        prediction: str,
        questions: list[str],
        raw_prediction_answers: list[str],
        raw_reference_answers: list[str],
        score_metadata: dict[str, Any],
    ) -> Score:
        """Evaluates the answers using the exact answer comparison mode.

        Args:
            prediction (str): The model output to evaluate.
            questions (list[str]): The list of questions generated for the model
                output.
            raw_prediction_answers (list[str]): The list of answers generated from
                the model output.
            raw_reference_answers (list[str]): The list of answers generated from
                the reference output.
            score_metadata (dict[str, Any]): The dictionary for storing metadata
                associated with the evaluation, returned with the score.

        Returns:
            Score: The Inspect AI Score object with the calculated result.
        """
        prediction_answers = [pa.strip().lower() for pa in raw_prediction_answers]
        reference_answers = [ra.strip().lower() for ra in raw_reference_answers]
        total_correct = sum(
            [pa == ra for pa, ra in zip(prediction_answers, reference_answers)]
        )
        accuracy = (
            total_correct / len(prediction_answers)
            if len(prediction_answers) > 0
            else 0.0
        )

        explanation = "QAGS Evaluation Report\n\n\nMismatched Q&As\n"
        for i, (question, pa, ra) in enumerate(
            zip(questions, prediction_answers, reference_answers)
        ):
            if pa == ra:
                continue
            explanation += (
                f"* [{i}] Q: {question}, PA: {pa}, RA: {ra}, "
                f"Match: {self._symbol_dict.get(False)}\n"
            )
        explanation += "\n\nAll Q&As\n"
        for i, (question, pa, ra) in enumerate(
            zip(questions, prediction_answers, reference_answers)
        ):
            explanation += (
                f"* [{i}] Q: {question}, PA: {pa}, RA: {ra},"
                f"Match: {self._symbol_dict.get(pa == ra)}\n"
            )
        explanation += (
            f"\n\nAccuracy: {accuracy:.2f} ({total_correct}/{len(prediction_answers)})"
        )

        score_metadata["prediction_answers"] = prediction_answers
        score_metadata["reference_answers"] = reference_answers
        score_metadata = {"explanation": explanation} | score_metadata

        return Score(
            value=accuracy,
            answer=prediction,
            explanation=explanation,
            metadata=score_metadata,
        )

    async def _evaluate_judge_answers(
        self,
        prediction: str,
        questions: list[str],
        raw_prediction_answers: list[str],
        raw_reference_answers: list[str],
        score_metadata: dict[str, Any],
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> Score:
        """Evaluates the answers using the judge answer comparison mode.

        Args:
            prediction (str): The model output to evaluate.
            questions (list[str]): The list of questions generated for the model
                output.
            raw_prediction_answers (list[str]): The list of answers generated from
                the model output.
            raw_reference_answers (list[str]): The list of answers generated from
                the reference output.
            score_metadata (dict[str, Any]): The dictionary for storing metadata
                associated with the evaluation, returned with the score.
            input (str | None, optional): The input to the model. Optional.
            reference (str | None, optional): The reference output to compare against.
                Optional.
            metadata (dict[str, Any] | None, optional): Additional Inspect AI sample/task
                state metadata. Optional.

        Returns:
            Score: The Inspect AI Score object with the calculated result.
        """
        answer_comparisons: list[float] = []
        for question, prediction_answer, reference_answer in zip(
            questions,
            raw_prediction_answers,
            raw_reference_answers,
        ):
            answer_comparison_prompt = self.config.get_answer_comparison_prompt(
                question=question,
                prediction_answer=prediction_answer,
                reference_answer=reference_answer,
                input=input,
                prediction=prediction,
                reference=reference,
                metadata=metadata,
            )
            answer_comparison_output = await self.model.generate(
                answer_comparison_prompt, config=self.generate_config
            )
            answer_comparison = float(
                extract_ternary_answer(
                    answer_comparison_output.completion,
                    binary_only=True,
                    unknown_on_mismatch=False,
                )
            )
            if self.config.logprobs:
                try:
                    answer_comparison = extract_weighted_binary_answer(
                        answer_comparison_output
                    )
                except ValueError as e:
                    if not self.warned_weighted_answer or self.config.debug:
                        self.warned_weighted_answer = True

                        error_message = (
                            f"❌  Cannot compute weighted comparison score: {e} "
                            "Falling back to binary comparison."
                        )

                        if not self.config.debug:
                            error_message += (
                                " Further errors will be suppressed "
                                + "(set debug=True to see all errors)."
                            )

                        logger.error(error_message)
            answer_comparisons.append(answer_comparison)

        def to_match_symbol(answer_comparison: float) -> str:
            if answer_comparison > 1 - self.config.ci:
                return self._symbol_dict[True]
            elif answer_comparison < self.config.ci:
                return self._symbol_dict[False]
            else:
                return self._symbol_dict[None]

        accuracy = sum(answer_comparisons) / len(answer_comparisons)

        explanation = "QAGS Evaluation Report\n\n\nMismatched Q&As\n"
        for i, (question, pa, ra, ac) in enumerate(
            zip(
                questions,
                raw_prediction_answers,
                raw_reference_answers,
                answer_comparisons,
            )
        ):
            if ac > 1 - self.config.ci:
                continue
            explanation += (
                f"* [{i}] Q: {question}, PA: {pa}, RA: {ra}, Score: {ac:.2f}, "
                f"Match: {to_match_symbol(ac)}\n"
            )
        explanation += "\n\nAll Q&As\n"
        for i, (question, pa, ra, ac) in enumerate(
            zip(
                questions,
                raw_prediction_answers,
                raw_reference_answers,
                answer_comparisons,
            )
        ):
            explanation += (
                f"* [{i}] Q: {question}, PA: {pa}, RA: {ra}, Score: {ac:.2f}, "
                f"Match: {to_match_symbol(ac)}\n"
            )
        explanation += (
            f"\n\nAccuracy: {accuracy:.2f} "
            + f"({sum(answer_comparisons):.2f}/{len(answer_comparisons)})"
        )

        score_metadata["prediction_answers"] = raw_prediction_answers
        score_metadata["reference_answers"] = raw_reference_answers
        score_metadata["answer_comparisons"] = answer_comparisons
        score_metadata = {"explanation": explanation} | score_metadata

        return Score(
            value=accuracy,
            answer=prediction,
            explanation=explanation,
            metadata=score_metadata,
        )

    @override
    async def calculate_async(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        **kwargs: dict,
    ) -> Score:
        """Asynchronously computes evaluation scores for QAGS.

        Args:
            prediction (str): The model output to evaluate.
            input (str, optional): The input to the model. Optional.
            reference (str, optional): The reference output to compare against.
                Optional.
            metadata (dict[str, Any], optional): Additional Inspect AI sample/task
                state metadata. Optional.
            **kwargs (dict): Additional keyword arguments specific to the given
                evaluation method.

        Returns:
            Score: The Inspect AI Score object with the calculated result.
        """

        score_metadata = {}

        all_questions = await self._generate_questions(
            prediction=prediction,
            score_metadata=score_metadata,
            input=input,
            reference=reference,
            metadata=metadata,
        )

        prediction_answers, reference_answers = await self._generate_answers(
            prediction=prediction,
            score_metadata=score_metadata,
            questions=all_questions,
            input=input,
            reference=reference,
            metadata=metadata,
        )

        match self.config.answer_comparison_mode:
            case "ternary":
                return self._evaluate_ternary_answers(
                    prediction=prediction,
                    questions=all_questions,
                    raw_prediction_answers=prediction_answers,
                    raw_reference_answers=reference_answers,
                    score_metadata=score_metadata,
                )
            case "exact":
                return self._evaluate_exact_answers(
                    prediction=prediction,
                    questions=all_questions,
                    raw_prediction_answers=prediction_answers,
                    raw_reference_answers=reference_answers,
                    score_metadata=score_metadata,
                )
            case "judge":
                return await self._evaluate_judge_answers(
                    prediction=prediction,
                    questions=all_questions,
                    raw_prediction_answers=prediction_answers,
                    raw_reference_answers=reference_answers,
                    score_metadata=score_metadata,
                    input=input,
                    reference=reference,
                    metadata=metadata,
                )
            case _:
                raise ValueError(
                    f"Invalid answer comparison mode: {self.config.answer_comparison_mode}. "
                    "Expected one of 'ternary', 'exact', 'judge'."
                )

generate_config property

generate_config: GenerateConfig

Generation configuration for the model.

__init__

__init__(
    model: Model,
    config: QagsConfig,
    name: str = "QAGS",
    debug: bool = False,
)

Initializes the QAGS score calculator.

Parameters:

Name Type Description Default
model Model

The model to use for evaluation.

required
config QagsConfig

The configuration for the QAGS score calculator.

required
name str

The name of the score calculator. Defaults to "QAGS".

'QAGS'
debug bool

Whether to report repeated errors in the log.

False
Source code in evalsense/evaluation/evaluators/qags.py
def __init__(
    self,
    model: Model,
    config: QagsConfig,
    name: str = "QAGS",
    debug: bool = False,
):
    """
    Initializes the QAGS score calculator.

    Args:
        model (Model): The model to use for evaluation.
        config (QagsConfig): The configuration for the QAGS score calculator.
        name (str): The name of the score calculator. Defaults to "QAGS".
        debug (bool): Whether to report repeated errors in the log.
    """
    self.model = model
    self.config = config
    self.name = name
    self.warned_weighted_answer = False

calculate

calculate(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score

This method is not supported for QAGS and will raise an error when called.

Use calculate_async instead.

Raises:

Type Description
NotImplementedError

When called, as synchronous evaluation is not supported for QAGS.

Source code in evalsense/evaluation/evaluators/qags.py
@override
def calculate(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score:
    """This method is not supported for QAGS and will raise an error when called.

    Use `calculate_async` instead.

    Raises:
        NotImplementedError: When called, as synchronous evaluation is not
            supported for QAGS.
    """
    raise NotImplementedError(
        "Synchronous evaluation is not supported for QAGS. "
        "Use calculate_async instead."
    )

calculate_async async

calculate_async(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score

Asynchronously computes evaluation scores for QAGS.

Parameters:

Name Type Description Default
prediction str

The model output to evaluate.

required
input str

The input to the model. Optional.

None
reference str

The reference output to compare against. Optional.

None
metadata dict[str, Any]

Additional Inspect AI sample/task state metadata. Optional.

None
**kwargs dict

Additional keyword arguments specific to the given evaluation method.

{}

Returns:

Name Type Description
Score Score

The Inspect AI Score object with the calculated result.

Source code in evalsense/evaluation/evaluators/qags.py
@override
async def calculate_async(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score:
    """Asynchronously computes evaluation scores for QAGS.

    Args:
        prediction (str): The model output to evaluate.
        input (str, optional): The input to the model. Optional.
        reference (str, optional): The reference output to compare against.
            Optional.
        metadata (dict[str, Any], optional): Additional Inspect AI sample/task
            state metadata. Optional.
        **kwargs (dict): Additional keyword arguments specific to the given
            evaluation method.

    Returns:
        Score: The Inspect AI Score object with the calculated result.
    """

    score_metadata = {}

    all_questions = await self._generate_questions(
        prediction=prediction,
        score_metadata=score_metadata,
        input=input,
        reference=reference,
        metadata=metadata,
    )

    prediction_answers, reference_answers = await self._generate_answers(
        prediction=prediction,
        score_metadata=score_metadata,
        questions=all_questions,
        input=input,
        reference=reference,
        metadata=metadata,
    )

    match self.config.answer_comparison_mode:
        case "ternary":
            return self._evaluate_ternary_answers(
                prediction=prediction,
                questions=all_questions,
                raw_prediction_answers=prediction_answers,
                raw_reference_answers=reference_answers,
                score_metadata=score_metadata,
            )
        case "exact":
            return self._evaluate_exact_answers(
                prediction=prediction,
                questions=all_questions,
                raw_prediction_answers=prediction_answers,
                raw_reference_answers=reference_answers,
                score_metadata=score_metadata,
            )
        case "judge":
            return await self._evaluate_judge_answers(
                prediction=prediction,
                questions=all_questions,
                raw_prediction_answers=prediction_answers,
                raw_reference_answers=reference_answers,
                score_metadata=score_metadata,
                input=input,
                reference=reference,
                metadata=metadata,
            )
        case _:
            raise ValueError(
                f"Invalid answer comparison mode: {self.config.answer_comparison_mode}. "
                "Expected one of 'ternary', 'exact', 'judge'."
            )

QagsScorerFactory

Bases: ScorerFactory

Scorer factory for QAGS.

Methods:

Name Description
__init__

Initialize the QAGS scorer factory.

create_scorer

Creates a QAGS scorer.

Source code in evalsense/evaluation/evaluators/qags.py
class QagsScorerFactory(ScorerFactory):
    """Scorer factory for QAGS."""

    def __init__(
        self,
        name: str,
        config: QagsConfig,
        metrics: list[Metric | dict[str, list[Metric]]]
        | dict[str, list[Metric]]
        | None = None,
    ):
        """
        Initialize the QAGS scorer factory.

        Args:
            config (QagsConfig): The configuration for the QAGS scorer.
            metrics (list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None):
                The metrics to use for the evaluation. If `None`, the default metric
                will be used (G-Eval).
        """
        self.name = name
        self.config = config
        if metrics is None:
            if self.config.answer_comparison_mode == "ternary":
                metrics = [
                    {
                        f"{name} Coverage": [mean()],
                        f"{name} Groundedness": [mean()],
                        f"{name} Accuracy": [mean()],
                    }
                ]
            else:
                metrics = [mean()]
        self.metrics = metrics

    @override
    def create_scorer(self, model: Model) -> Scorer:
        """
        Creates a QAGS scorer.

        Args:
            model (Model): The model to create a scorer for.

        Returns:
            Scorer: The created QAGS scorer.
        """

        @scorer(name=self.name, metrics=self.metrics)
        def qags_scorer() -> Scorer:
            qags_score_calculator = QagsScoreCalculator(
                model=model,
                config=self.config,
                name=self.name,
            )

            async def score(state: TaskState, target: Target):
                return await qags_score_calculator.calculate_async(
                    input=state.input_text,
                    prediction=state.output.completion,
                    reference=target.text,
                    metadata=state.metadata,
                )

            return score

        return qags_scorer()

__init__

__init__(
    name: str,
    config: QagsConfig,
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
)

Initialize the QAGS scorer factory.

Parameters:

Name Type Description Default
config QagsConfig

The configuration for the QAGS scorer.

required
metrics list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None

The metrics to use for the evaluation. If None, the default metric will be used (G-Eval).

None
Source code in evalsense/evaluation/evaluators/qags.py
def __init__(
    self,
    name: str,
    config: QagsConfig,
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
):
    """
    Initialize the QAGS scorer factory.

    Args:
        config (QagsConfig): The configuration for the QAGS scorer.
        metrics (list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None):
            The metrics to use for the evaluation. If `None`, the default metric
            will be used (G-Eval).
    """
    self.name = name
    self.config = config
    if metrics is None:
        if self.config.answer_comparison_mode == "ternary":
            metrics = [
                {
                    f"{name} Coverage": [mean()],
                    f"{name} Groundedness": [mean()],
                    f"{name} Accuracy": [mean()],
                }
            ]
        else:
            metrics = [mean()]
    self.metrics = metrics

create_scorer

create_scorer(model: Model) -> Scorer

Creates a QAGS scorer.

Parameters:

Name Type Description Default
model Model

The model to create a scorer for.

required

Returns:

Name Type Description
Scorer Scorer

The created QAGS scorer.

Source code in evalsense/evaluation/evaluators/qags.py
@override
def create_scorer(self, model: Model) -> Scorer:
    """
    Creates a QAGS scorer.

    Args:
        model (Model): The model to create a scorer for.

    Returns:
        Scorer: The created QAGS scorer.
    """

    @scorer(name=self.name, metrics=self.metrics)
    def qags_scorer() -> Scorer:
        qags_score_calculator = QagsScoreCalculator(
            model=model,
            config=self.config,
            name=self.name,
        )

        async def score(state: TaskState, target: Target):
            return await qags_score_calculator.calculate_async(
                input=state.input_text,
                prediction=state.output.completion,
                reference=target.text,
                metadata=state.metadata,
            )

        return score

    return qags_scorer()

RougeScoreCalculator

Bases: ScoreCalculator

Calculator for computing ROUGE scores.

Methods:

Name Description
calculate

Calculates ROUGE scores for the supplied model prediction and reference input.

calculate_async

Calculates ROUGE scores for the supplied model prediction and reference input.

Source code in evalsense/evaluation/evaluators/rouge.py
class RougeScoreCalculator(ScoreCalculator):
    """Calculator for computing ROUGE scores."""

    def __init__(self):
        self.rouge_module = evaluate.load("rouge")

    @override
    def calculate(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        **kwargs: dict,
    ) -> Score:
        """
        Calculates ROUGE scores for the supplied model prediction and reference input.

        Args:
            prediction (str): The text of the prediction from the model.
            input (str, optional): The text of the input to the model. Ignored for ROUGE.
            reference (str, optional): The text of the reference input to compare against.
            metadata (dict[str, Any], optional): Additional metadata for the score.
                Ignored for ROUGE.

        Returns:
            Score: Inspect AI Score with the calculated evaluation results.
        """
        if reference is None:
            raise ValueError("Reference is required for computing ROUGE, but was None.")

        predictions = [prediction]
        references = [reference]

        result = self.rouge_module.compute(
            predictions=predictions, references=references
        )
        return Score(
            value={
                "ROUGE-1": result["rouge1"],  # type: ignore
                "ROUGE-2": result["rouge2"],  # type: ignore
                "ROUGE-L": result["rougeL"],  # type: ignore
            },
            answer=prediction,
        )

    @override
    async def calculate_async(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        **kwargs: dict,
    ) -> Score:
        """
        Calculates ROUGE scores for the supplied model prediction and reference input.

        Args:
            prediction (str): The text of the prediction from the model.
            input (str, optional): The text of the input to the model. Ignored for ROUGE.
            reference (str, optional): The text of the reference input to compare against.
            metadata (dict[str, Any], optional): Additional metadata for the score.
                Ignored for ROUGE.

        Returns:
            Score: Inspect AI Score with the calculated evaluation results.
        """
        return self.calculate(
            prediction=prediction,
            input=input,
            reference=reference,
            metadata=metadata,
            **kwargs,
        )

calculate

calculate(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score

Calculates ROUGE scores for the supplied model prediction and reference input.

Parameters:

Name Type Description Default
prediction str

The text of the prediction from the model.

required
input str

The text of the input to the model. Ignored for ROUGE.

None
reference str

The text of the reference input to compare against.

None
metadata dict[str, Any]

Additional metadata for the score. Ignored for ROUGE.

None

Returns:

Name Type Description
Score Score

Inspect AI Score with the calculated evaluation results.

Source code in evalsense/evaluation/evaluators/rouge.py
@override
def calculate(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score:
    """
    Calculates ROUGE scores for the supplied model prediction and reference input.

    Args:
        prediction (str): The text of the prediction from the model.
        input (str, optional): The text of the input to the model. Ignored for ROUGE.
        reference (str, optional): The text of the reference input to compare against.
        metadata (dict[str, Any], optional): Additional metadata for the score.
            Ignored for ROUGE.

    Returns:
        Score: Inspect AI Score with the calculated evaluation results.
    """
    if reference is None:
        raise ValueError("Reference is required for computing ROUGE, but was None.")

    predictions = [prediction]
    references = [reference]

    result = self.rouge_module.compute(
        predictions=predictions, references=references
    )
    return Score(
        value={
            "ROUGE-1": result["rouge1"],  # type: ignore
            "ROUGE-2": result["rouge2"],  # type: ignore
            "ROUGE-L": result["rougeL"],  # type: ignore
        },
        answer=prediction,
    )

calculate_async async

calculate_async(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score

Calculates ROUGE scores for the supplied model prediction and reference input.

Parameters:

Name Type Description Default
prediction str

The text of the prediction from the model.

required
input str

The text of the input to the model. Ignored for ROUGE.

None
reference str

The text of the reference input to compare against.

None
metadata dict[str, Any]

Additional metadata for the score. Ignored for ROUGE.

None

Returns:

Name Type Description
Score Score

Inspect AI Score with the calculated evaluation results.

Source code in evalsense/evaluation/evaluators/rouge.py
@override
async def calculate_async(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score:
    """
    Calculates ROUGE scores for the supplied model prediction and reference input.

    Args:
        prediction (str): The text of the prediction from the model.
        input (str, optional): The text of the input to the model. Ignored for ROUGE.
        reference (str, optional): The text of the reference input to compare against.
        metadata (dict[str, Any], optional): Additional metadata for the score.
            Ignored for ROUGE.

    Returns:
        Score: Inspect AI Score with the calculated evaluation results.
    """
    return self.calculate(
        prediction=prediction,
        input=input,
        reference=reference,
        metadata=metadata,
        **kwargs,
    )

bleu_metric

bleu_metric() -> MetricProtocol

Base metric for BLEU scores.

Returns:

Name Type Description
MetricProtocol MetricProtocol

A function that computes BLEU scores.

Source code in evalsense/evaluation/evaluators/bleu.py
def bleu_metric() -> MetricProtocol:
    """
    Base metric for BLEU scores.

    Returns:
        MetricProtocol: A function that computes BLEU scores.
    """

    def metric(scores: list[SampleScore]) -> Value:
        bleu_module = evaluate.load("bleu")
        predictions = [score.score.metadata["prediction"] for score in scores]  # type: ignore
        references = [score.score.metadata["reference"] for score in scores]  # type: ignore
        result = bleu_module.compute(predictions=predictions, references=references)
        result = cast(dict[str, Any], result)
        return result["bleu"]

    return metric

get_bertscore_evaluator

get_bertscore_evaluator(
    *,
    name: str = "BERTScore",
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
    model_type: str = "microsoft/deberta-xlarge-mnli",
    lang: str = "en",
    num_layers: int | None = None,
    verbose: bool = False,
    idf: bool | dict[str, float] = False,
    device: str | None = None,
    batch_size: int = 64,
    nthreads: int = 1,
    rescale_with_baseline: bool = False,
    baseline_path: str | None = None,
    use_fast_tokenizer: bool = False,
) -> Evaluator

Returns a BERTScore evaluator.

Parameters:

Name Type Description Default
name str

The name of the evaluator and scorer. Defaults to "BERTScore".

'BERTScore'
metrics list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None

The metrics to use for the evaluation. If None, the default metrics will be used (BERTScore Precision, Recall, and F1 with mean aggregation).

None
model_type str

The model type to use for computing BERTScore. Defaults to "microsoft/deberta-xlarge-mnli", the currently best-performing model according to BERTScore authors.

'microsoft/deberta-xlarge-mnli'
lang str

The language of the text. Defaults to "en".

'en'
num_layers int | None

The layer of representations to use. The default is the number of layers tuned on WMT16 correlation data, which depends on the model_type used.

None
verbose bool

Whether to turn on verbose mode. Defaults to False

False
idf bool | dict

Use IDF weighting — can be a precomputed IDF dictionary. Defaults to False (no IDF weighting).

False
device str | None

The device to use for computing the contextual embeddings. If this argument is not set or None, the model will be loaded on cuda:0 if available.

None
nthreads int

The number of threads to use for computing the contextual embeddings. Defaults to 1.

1
batch_size int

The batch size to use for computing the contextual embeddings. Defaults to 64.

64
rescale_with_baseline bool

Whether to rescale the BERTScore with pre-computed baseline. The default value is False.

False
baseline_path str | None

Customized baseline file.

None
use_fast_tokenizer bool

The use_fast parameter passed to HF tokenizer. Defaults to False.

False

Returns:

Name Type Description
Evaluator Evaluator

The BERTScore evaluator.

Source code in evalsense/evaluation/evaluators/bertscore.py
def get_bertscore_evaluator(
    *,
    name: str = "BERTScore",
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
    model_type: str = "microsoft/deberta-xlarge-mnli",
    lang: str = "en",
    num_layers: int | None = None,
    verbose: bool = False,
    idf: bool | dict[str, float] = False,
    device: str | None = None,
    batch_size: int = 64,
    nthreads: int = 1,
    rescale_with_baseline: bool = False,
    baseline_path: str | None = None,
    use_fast_tokenizer: bool = False,
) -> Evaluator:
    """
    Returns a BERTScore evaluator.

    Args:
        name (str): The name of the evaluator and scorer. Defaults to "BERTScore".
        metrics (list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None):
            The metrics to use for the evaluation. If `None`, the default metrics
            will be used (BERTScore Precision, Recall, and F1 with mean aggregation).
        model_type (str, optional): The model type to use for computing BERTScore.
            Defaults to "microsoft/deberta-xlarge-mnli", the currently best-performing
            model according to BERTScore authors.
        lang (str, optional): The language of the text. Defaults to "en".
        num_layers (int | None, optional): The layer of representations to use. The
            default is the number of layers tuned on WMT16 correlation data, which
            depends on the `model_type` used.
        verbose (bool, optional): Whether to turn on verbose mode. Defaults to `False`
        idf (bool | dict, optional): Use IDF weighting — can be a precomputed IDF dictionary.
            Defaults to `False` (no IDF weighting).
        device (str | None, optional): The device to use for computing the contextual
            embeddings. If this argument is not set or `None`, the model will be
            loaded on `cuda:0` if available.
        nthreads (int, optional): The number of threads to use for computing the
            contextual embeddings. Defaults to `1`.
        batch_size (int, optional): The batch size to use for computing the
            contextual embeddings. Defaults to `64`.
        rescale_with_baseline (bool, optional): Whether to rescale the BERTScore with
            pre-computed baseline. The default value is `False`.
        baseline_path (str | None, optional): Customized baseline file.
        use_fast_tokenizer (bool, optional): The `use_fast` parameter passed to HF
            tokenizer. Defaults to `False`.

    Returns:
        Evaluator: The BERTScore evaluator.
    """
    if metrics is None:
        metrics = [
            {"BERTScore Precision": [mean()]},
            {"BERTScore Recall": [mean()]},
            {"BERTScore F1": [mean()]},
        ]

    calculator = BertScoreCalculator(
        model_type=model_type,
        lang=lang,
        num_layers=num_layers,
        idf=idf,
    )

    async def init_bertscore() -> None:
        async with concurrency("init_bertscore", 1):
            if not hasattr(calculator, "bertscore_module"):
                setattr(
                    calculator,
                    "bertscore_module",
                    evaluate.load("bertscore"),
                )

    def cleanup_bertscore() -> None:
        import torch

        del calculator.bertscore_module
        gc.collect()
        torch.cuda.empty_cache()

    @scorer(
        name=name,
        metrics=metrics,
    )
    def bertscore_scorer() -> Scorer:
        async def score(state: TaskState, target: Target) -> Score:
            await init_bertscore()

            return await calculator.calculate_async(
                prediction=state.output.completion,
                reference=target.text,
                verbose=verbose,
                device=device,
                batch_size=batch_size,
                nthreads=nthreads,
                rescale_with_baseline=rescale_with_baseline,
                baseline_path=baseline_path,
                use_fast_tokenizer=use_fast_tokenizer,
            )

        return score

    return Evaluator(
        name=name,
        scorer=bertscore_scorer(),
        cleanup_fun=cleanup_bertscore,
    )

get_bleu_evaluator

get_bleu_evaluator(
    name: str = "BLEU",
    scorer_name: str = "BLEU Precision",
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
) -> Evaluator

Returns an evaluator for BLEU scores.

Parameters:

Name Type Description Default
name str

The name of the metric and evaluator. Defaults to "BLEU".

'BLEU'
scorer_name str

The name of the internal scorer. Defaults to "BLEU Precision".

'BLEU Precision'
metrics list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None

The metrics to use for the evaluation. If None, the default metric will be used (BLEU).

None

Returns:

Name Type Description
Evaluator Evaluator

An evaluator for BLEU scores.

Source code in evalsense/evaluation/evaluators/bleu.py
def get_bleu_evaluator(
    name: str = "BLEU",
    scorer_name: str = "BLEU Precision",
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
) -> Evaluator:
    """
    Returns an evaluator for BLEU scores.

    Args:
        name (str): The name of the metric and evaluator. Defaults to "BLEU".
        scorer_name (str): The name of the internal scorer. Defaults to "BLEU Precision".
        metrics (list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None):
            The metrics to use for the evaluation. If `None`, the default metric
            will be used (BLEU).

    Returns:
        Evaluator: An evaluator for BLEU scores.
    """

    @metric(name=name)
    def bleu() -> MetricProtocol:
        return bleu_metric()

    if metrics is None:
        metrics = [bleu()]

    bleu_calculator = BleuPrecisionScoreCalculator()

    @scorer(name=scorer_name, metrics=metrics)
    def bleu_precision_scorer() -> Scorer:
        async def score(state: TaskState, target: Target):
            return await bleu_calculator.calculate_async(
                prediction=state.output.completion, reference=target.text
            )

        return score

    return Evaluator(name, scorer=bleu_precision_scorer())

get_g_eval_evaluator

get_g_eval_evaluator(
    *,
    name: str = "G-Eval",
    quality_name: str = "Unknown",
    model_name: str | None = None,
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
    prompt_template: str,
    model_config: ModelConfig,
    logprobs: bool = True,
    top_logprobs: int = 20,
    min_score: int = 1,
    max_score: int = 10,
    normalise: bool = True,
    debug: bool = False,
) -> Evaluator

Constructs a G-Eval evaluator that can be used in EvalSense evaluation pipeline.

Parameters:

Name Type Description Default
name str

The name of the evaluator. Defaults to "G-Eval".

'G-Eval'
quality_name str

The name of the quality to be evaluated by G-Eval.

'Unknown'
model_name str | None

The name of the model to be used for evaluation. If None, the model name will be taken from the model config.

None
metrics list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None

The metrics to use for the evaluation. If None, the default metric will be used (G-Eval).

None
prompt_template str

The prompt template to use. The supplied template should be a format string with {prediction} and (optionally) {reference} as placeholders, as well as any additional placeholders for entries in Inspect AI sample/task state metadata. The template should instruct the judge model to respond with a numerical score between the specified min_score and max_score.

required
model_config ModelConfig

The model configuration.

required
logprobs bool

Whether to use model log probabilities to compute weighted evaluation score instead of a standard score.

True
top_logprobs int

The number of top log probabilities to consider.

20
min_score int

The minimum valid score.

1
max_score int

The maximum valid score.

10
normalise bool

Whether to normalise the scores between 0 and 1.

True
debug bool

Whether to report repeated errors in the log.

False

Returns:

Name Type Description
Evaluator Evaluator

The constructed G-Eval evaluator.

Source code in evalsense/evaluation/evaluators/g_eval.py
def get_g_eval_evaluator(
    *,
    name: str = "G-Eval",
    quality_name: str = "Unknown",
    model_name: str | None = None,
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
    prompt_template: str,
    model_config: ModelConfig,
    logprobs: bool = True,
    top_logprobs: int = 20,
    min_score: int = 1,
    max_score: int = 10,
    normalise: bool = True,
    debug: bool = False,
) -> Evaluator:
    """
    Constructs a G-Eval evaluator that can be used in EvalSense evaluation pipeline.

    Args:
        name (str): The name of the evaluator. Defaults to "G-Eval".
        quality_name (str): The name of the quality to be evaluated by G-Eval.
        model_name (str | None): The name of the model to be used for evaluation.
            If `None`, the model name will be taken from the model config.
        metrics (list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None):
            The metrics to use for the evaluation. If `None`, the default metric
            will be used (G-Eval).
        prompt_template (str): The prompt template to use. The supplied template should
            be a format string with {prediction} and (optionally) {reference} as
            placeholders, as well as any additional placeholders for entries in
            Inspect AI sample/task state metadata. The template should instruct the
            judge model to respond with a numerical score between the specified
            min_score and max_score.
        model_config (ModelConfig): The model configuration.
        logprobs (bool): Whether to use model log probabilities to compute weighted
            evaluation score instead of a standard score.
        top_logprobs (int): The number of top log probabilities to consider.
        min_score (int): The minimum valid score.
        max_score (int): The maximum valid score.
        normalise (bool): Whether to normalise the scores between 0 and 1.
        debug (bool): Whether to report repeated errors in the log.

    Returns:
        Evaluator: The constructed G-Eval evaluator.
    """
    metric_name = f"{name} ({quality_name}, {model_name or model_config.name})"
    return Evaluator(
        name=metric_name,
        scorer=GEvalScorerFactory(
            name=metric_name,
            metrics=metrics,
            prompt_template=prompt_template,
            logprobs=logprobs,
            top_logprobs=top_logprobs,
            min_score=min_score,
            max_score=max_score,
            normalise=normalise,
            debug=debug,
        ),
        model_config=model_config,
    )

get_qags_evaluator

get_qags_evaluator(
    *,
    config: QagsConfig,
    name: str = "QAGS",
    model_name: str | None = None,
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
    model_config: ModelConfig,
) -> Evaluator

Constructs a QAGS evaluator that can be used in EvalSense evaluation pipeline.

Parameters:

Name Type Description Default
config QagsConfig

The configuration for the QAGS evaluator.

required
name str

The name of the QAGS evaluator.

'QAGS'
model_name str | None

The name of the model to use for evaluation. If None, the name from the model configuration will be used.

None
metrics list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None

The metrics to use for the evaluation. If None, the default metrics will be used (QAGS precision, recall and F1).

None
model_config ModelConfig

The configuration of the model to be used for evaluation.

required

Returns:

Name Type Description
Evaluator Evaluator

The constructed QAGS evaluator.

Source code in evalsense/evaluation/evaluators/qags.py
def get_qags_evaluator(
    *,
    config: QagsConfig,
    name: str = "QAGS",
    model_name: str | None = None,
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
    model_config: ModelConfig,
) -> Evaluator:
    """
    Constructs a QAGS evaluator that can be used in EvalSense evaluation pipeline.

    Args:
        config (QagsConfig): The configuration for the QAGS evaluator.
        name (str): The name of the QAGS evaluator.
        model_name (str | None): The name of the model to use for evaluation.
            If `None`, the name from the model configuration will be used.
        metrics (list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None):
            The metrics to use for the evaluation. If `None`, the default metrics
            will be used (QAGS precision, recall and F1).
        model_config (ModelConfig): The configuration of the model to be used
            for evaluation.

    Returns:
        Evaluator: The constructed QAGS evaluator.
    """
    metric_name = f"{name} ({model_name or model_config.name})"
    return Evaluator(
        name=metric_name,
        scorer=QagsScorerFactory(
            name=metric_name,
            config=config,
            metrics=metrics,
        ),
        model_config=model_config,
    )

get_rouge_evaluator

get_rouge_evaluator(
    name: str = "ROUGE",
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
) -> Evaluator

Returns an evaluator for ROUGE scores.

Parameters:

Name Type Description Default
name str

The name of the evaluator. Defaults to "ROUGE".

'ROUGE'
metrics list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None

The metrics to use for evaluation. If None, defaults to ROUGE-1, ROUGE-2, and ROUGE-L with mean aggregation.

None

Returns:

Name Type Description
Evaluator Evaluator

An evaluator for ROUGE scores.

Source code in evalsense/evaluation/evaluators/rouge.py
def get_rouge_evaluator(
    name: str = "ROUGE",
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
) -> Evaluator:
    """
    Returns an evaluator for ROUGE scores.

    Args:
        name (str): The name of the evaluator. Defaults to "ROUGE".
        metrics (list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None):
            The metrics to use for evaluation. If None, defaults to ROUGE-1, ROUGE-2,
            and ROUGE-L with mean aggregation.

    Returns:
        Evaluator: An evaluator for ROUGE scores.
    """
    if metrics is None:
        metrics = [
            {
                "ROUGE-1": [mean()],
                "ROUGE-2": [mean()],
                "ROUGE-L": [mean()],
            }
        ]

    rouge_calculator = RougeScoreCalculator()

    @scorer(name=name, metrics=metrics)
    def rouge_scorer() -> Scorer:
        async def score(state: TaskState, target: Target) -> Score:
            return await rouge_calculator.calculate_async(
                prediction=state.output.completion, reference=target.text
            )

        return score

    return Evaluator(name, scorer=rouge_scorer())