Evaluators

Modules:

Name	Description
`bertscore`
`bleu`
`g_eval`
`qags`
`rouge`

Classes:

Name	Description
`BertScoreCalculator`	Calculator for computing BERTScores.
`BleuPrecisionScoreCalculator`	Calculator for computing BLEU scores.
`GEvalScoreCalculator`	G-Eval score calculator.
`GEvalScorerFactory`	Scorer factory for G-Eval.
`QagsConfig`	A protocol for configuring QAGS evaluation.
`QagsScoreCalculator`	QAGS score calculator.
`QagsScorerFactory`	Scorer factory for QAGS.
`RougeScoreCalculator`	Calculator for computing ROUGE scores.

Functions:

Name	Description
`bleu_metric`	Base metric for BLEU scores.
`get_bertscore_evaluator`	Returns a BERTScore evaluator.
`get_bleu_evaluator`	Returns an evaluator for BLEU scores.
`get_g_eval_evaluator`	Constructs a G-Eval evaluator that can be used in EvalSense evaluation pipeline.
`get_qags_evaluator`	Constructs a QAGS evaluator that can be used in EvalSense evaluation pipeline.
`get_rouge_evaluator`	Returns an evaluator for ROUGE scores.

BertScoreCalculator

Bases: ScoreCalculator

Calculator for computing BERTScores.

Methods:

Name	Description
`__init__`	Initializes the BERTScore calculator.
`calculate`	Calculates BERTScore for the supplied model prediction and reference input.
`calculate_async`	Calculates BERTScore for the supplied model prediction and reference input.

Source code in evalsense/evaluation/evaluators/bertscore.py

class BertScoreCalculator(ScoreCalculator):
    """Calculator for computing BERTScores."""

    def __init__(
        self,
        model_type: str = "microsoft/deberta-xlarge-mnli",
        lang: str = "en",
        num_layers: int | None = None,
        idf: bool | dict[str, float] = False,
    ):
        """
        Initializes the BERTScore calculator.

        Args:
            model_type (str, optional): The model type to use for computing BERTScore.
                Defaults to "microsoft/deberta-xlarge-mnli", the currently best-performing
                model according to BERTScore authors.
            lang (str, optional): The language of the text. Defaults to "en".
            num_layers (int, optional): The layer of representations to use.
            idf (bool | dict[str, float], optional): Use IDF weighting — can be a precomputed IDF dictionary.
        """
        self.bertscore_module = evaluate.load("bertscore")
        self.model_type = model_type
        self.lang = lang
        self.num_layers = num_layers
        self.idf = idf

    @override
    def calculate(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        verbose=False,
        device=None,
        batch_size=64,
        nthreads=1,
        rescale_with_baseline=False,
        baseline_path=None,
        use_fast_tokenizer=False,
        **kwargs: dict,
    ) -> Score:
        """
        Calculates BERTScore for the supplied model prediction and reference input.

        Args:
            prediction (str): The text of the prediction from the model.
            input (str, optional): The text of the model input. Ignored for BERTScore.
            reference (str, optional): The text of the reference input to compare against.
            metadata (dict[str, Any], optional): Metadata for the evaluation. Ignored for BERTScore.
            verbose (bool): Whether to turn on verbose mode.
            device (str, optional): The device to use for computing the contextual embeddings.
            batch_size (int): The batch size to use for computing the contextual embeddings.
            nthreads (int): The number of threads to use for computing the contextual embeddings.
            rescale_with_baseline (bool): Whether to rescale the BERTScore with pre-computed baseline.
            baseline_path (str, optional): Customized baseline file.
            use_fast_tokenizer (bool): The `use_fast` parameter passed to HF tokenizer.

        Returns:
            Score: Inspect AI Score with the calculated evaluation results.
        """
        if reference is None:
            raise ValueError(
                "Reference is required for computing BERTScore, but was None."
            )

        predictions = [prediction]
        references = [reference]

        result = self.bertscore_module.compute(
            predictions=predictions,
            references=references,
            lang=self.lang,
            model_type=self.model_type,
            num_layers=self.num_layers,
            verbose=verbose,
            idf=self.idf,
            device=device,
            batch_size=batch_size,
            nthreads=nthreads,
            rescale_with_baseline=rescale_with_baseline,
            baseline_path=baseline_path,
            use_fast_tokenizer=use_fast_tokenizer,
        )
        return Score(
            value={
                "BERTScore Precision": result["precision"][0],  # type: ignore
                "BERTScore Recall": result["recall"][0],  # type: ignore
                "BERTScore F1": result["f1"][0],  # type: ignore
            },
            answer=prediction,
            metadata={
                "hashcode": result["hashcode"],  # type: ignore
            },
        )

    @override
    async def calculate_async(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        verbose=False,
        device=None,
        batch_size=64,
        nthreads=1,
        rescale_with_baseline=False,
        baseline_path=None,
        use_fast_tokenizer=False,
        **kwargs: dict,
    ) -> Score:
        """
        Calculates BERTScore for the supplied model prediction and reference input.

        Args:
            prediction (str): The text of the prediction from the model.
            input (str | None): The text of the model input. Ignored for BERTScore.
            reference (str, optional): The text of the reference input to compare against.
            metadata (dict[str, Any] | None): Metadata for the evaluation. Ignored for BERTScore.
            verbose (bool): Whether to turn on verbose mode.
            device (str | None): The device to use for computing the contextual embeddings.
            batch_size (int): The batch size to use for computing the contextual embeddings.
            nthreads (int): The number of threads to use for computing the contextual embeddings.
            rescale_with_baseline (bool): Whether to rescale the BERTScore with pre-computed baseline.
            baseline_path (str | None): Customized baseline file.
            use_fast_tokenizer (bool): The `use_fast` parameter passed to HF tokenizer.

        Returns:
            Score: Inspect AI Score with the calculated evaluation results.
        """
        return self.calculate(
            prediction=prediction,
            reference=reference,
            verbose=verbose,
            device=device,
            batch_size=batch_size,
            nthreads=nthreads,
            rescale_with_baseline=rescale_with_baseline,
            baseline_path=baseline_path,
            use_fast_tokenizer=use_fast_tokenizer,
        )

init

__init__(
    model_type: str = "microsoft/deberta-xlarge-mnli",
    lang: str = "en",
    num_layers: int | None = None,
    idf: bool | dict[str, float] = False,
)

Initializes the BERTScore calculator.

Parameters:

Name	Type	Description	Default
`model_type`	`str`	The model type to use for computing BERTScore. Defaults to "microsoft/deberta-xlarge-mnli", the currently best-performing model according to BERTScore authors.	`'microsoft/deberta-xlarge-mnli'`
`lang`	`str`	The language of the text. Defaults to "en".	`'en'`
`num_layers`	`int`	The layer of representations to use.	`None`
`idf`	`bool \| dict[str, float]`	Use IDF weighting — can be a precomputed IDF dictionary.	`False`

Source code in evalsense/evaluation/evaluators/bertscore.py

def __init__(
    self,
    model_type: str = "microsoft/deberta-xlarge-mnli",
    lang: str = "en",
    num_layers: int | None = None,
    idf: bool | dict[str, float] = False,
):
    """
    Initializes the BERTScore calculator.

    Args:
        model_type (str, optional): The model type to use for computing BERTScore.
            Defaults to "microsoft/deberta-xlarge-mnli", the currently best-performing
            model according to BERTScore authors.
        lang (str, optional): The language of the text. Defaults to "en".
        num_layers (int, optional): The layer of representations to use.
        idf (bool | dict[str, float], optional): Use IDF weighting — can be a precomputed IDF dictionary.
    """
    self.bertscore_module = evaluate.load("bertscore")
    self.model_type = model_type
    self.lang = lang
    self.num_layers = num_layers
    self.idf = idf

calculate

calculate(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    verbose=False,
    device=None,
    batch_size=64,
    nthreads=1,
    rescale_with_baseline=False,
    baseline_path=None,
    use_fast_tokenizer=False,
    **kwargs: dict,
) -> Score

Calculates BERTScore for the supplied model prediction and reference input.

Parameters:

Name	Type	Description	Default
`prediction`	`str`	The text of the prediction from the model.	required
`input`	`str`	The text of the model input. Ignored for BERTScore.	`None`
`reference`	`str`	The text of the reference input to compare against.	`None`
`metadata`	`dict[str, Any]`	Metadata for the evaluation. Ignored for BERTScore.	`None`
`verbose`	`bool`	Whether to turn on verbose mode.	`False`
`device`	`str`	The device to use for computing the contextual embeddings.	`None`
`batch_size`	`int`	The batch size to use for computing the contextual embeddings.	`64`
`nthreads`	`int`	The number of threads to use for computing the contextual embeddings.	`1`
`rescale_with_baseline`	`bool`	Whether to rescale the BERTScore with pre-computed baseline.	`False`
`baseline_path`	`str`	Customized baseline file.	`None`
`use_fast_tokenizer`	`bool`	The `use_fast` parameter passed to HF tokenizer.	`False`

Returns:

Name	Type	Description
`Score`	`Score`	Inspect AI Score with the calculated evaluation results.

Source code in evalsense/evaluation/evaluators/bertscore.py

@override
def calculate(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    verbose=False,
    device=None,
    batch_size=64,
    nthreads=1,
    rescale_with_baseline=False,
    baseline_path=None,
    use_fast_tokenizer=False,
    **kwargs: dict,
) -> Score:
    """
    Calculates BERTScore for the supplied model prediction and reference input.

    Args:
        prediction (str): The text of the prediction from the model.
        input (str, optional): The text of the model input. Ignored for BERTScore.
        reference (str, optional): The text of the reference input to compare against.
        metadata (dict[str, Any], optional): Metadata for the evaluation. Ignored for BERTScore.
        verbose (bool): Whether to turn on verbose mode.
        device (str, optional): The device to use for computing the contextual embeddings.
        batch_size (int): The batch size to use for computing the contextual embeddings.
        nthreads (int): The number of threads to use for computing the contextual embeddings.
        rescale_with_baseline (bool): Whether to rescale the BERTScore with pre-computed baseline.
        baseline_path (str, optional): Customized baseline file.
        use_fast_tokenizer (bool): The `use_fast` parameter passed to HF tokenizer.

    Returns:
        Score: Inspect AI Score with the calculated evaluation results.
    """
    if reference is None:
        raise ValueError(
            "Reference is required for computing BERTScore, but was None."
        )

    predictions = [prediction]
    references = [reference]

    result = self.bertscore_module.compute(
        predictions=predictions,
        references=references,
        lang=self.lang,
        model_type=self.model_type,
        num_layers=self.num_layers,
        verbose=verbose,
        idf=self.idf,
        device=device,
        batch_size=batch_size,
        nthreads=nthreads,
        rescale_with_baseline=rescale_with_baseline,
        baseline_path=baseline_path,
        use_fast_tokenizer=use_fast_tokenizer,
    )
    return Score(
        value={
            "BERTScore Precision": result["precision"][0],  # type: ignore
            "BERTScore Recall": result["recall"][0],  # type: ignore
            "BERTScore F1": result["f1"][0],  # type: ignore
        },
        answer=prediction,
        metadata={
            "hashcode": result["hashcode"],  # type: ignore
        },
    )

calculate_async `async`

calculate_async(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    verbose=False,
    device=None,
    batch_size=64,
    nthreads=1,
    rescale_with_baseline=False,
    baseline_path=None,
    use_fast_tokenizer=False,
    **kwargs: dict,
) -> Score

Calculates BERTScore for the supplied model prediction and reference input.

Parameters:

Name	Type	Description	Default
`prediction`	`str`	The text of the prediction from the model.	required
`input`	`str \| None`	The text of the model input. Ignored for BERTScore.	`None`
`reference`	`str`	The text of the reference input to compare against.	`None`
`metadata`	`dict[str, Any] \| None`	Metadata for the evaluation. Ignored for BERTScore.	`None`
`verbose`	`bool`	Whether to turn on verbose mode.	`False`
`device`	`str \| None`	The device to use for computing the contextual embeddings.	`None`
`batch_size`	`int`	The batch size to use for computing the contextual embeddings.	`64`
`nthreads`	`int`	The number of threads to use for computing the contextual embeddings.	`1`
`rescale_with_baseline`	`bool`	Whether to rescale the BERTScore with pre-computed baseline.	`False`
`baseline_path`	`str \| None`	Customized baseline file.	`None`
`use_fast_tokenizer`	`bool`	The `use_fast` parameter passed to HF tokenizer.	`False`

Returns:

Name	Type	Description
`Score`	`Score`	Inspect AI Score with the calculated evaluation results.

Source code in evalsense/evaluation/evaluators/bertscore.py

@override
async def calculate_async(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    verbose=False,
    device=None,
    batch_size=64,
    nthreads=1,
    rescale_with_baseline=False,
    baseline_path=None,
    use_fast_tokenizer=False,
    **kwargs: dict,
) -> Score:
    """
    Calculates BERTScore for the supplied model prediction and reference input.

    Args:
        prediction (str): The text of the prediction from the model.
        input (str | None): The text of the model input. Ignored for BERTScore.
        reference (str, optional): The text of the reference input to compare against.
        metadata (dict[str, Any] | None): Metadata for the evaluation. Ignored for BERTScore.
        verbose (bool): Whether to turn on verbose mode.
        device (str | None): The device to use for computing the contextual embeddings.
        batch_size (int): The batch size to use for computing the contextual embeddings.
        nthreads (int): The number of threads to use for computing the contextual embeddings.
        rescale_with_baseline (bool): Whether to rescale the BERTScore with pre-computed baseline.
        baseline_path (str | None): Customized baseline file.
        use_fast_tokenizer (bool): The `use_fast` parameter passed to HF tokenizer.

    Returns:
        Score: Inspect AI Score with the calculated evaluation results.
    """
    return self.calculate(
        prediction=prediction,
        reference=reference,
        verbose=verbose,
        device=device,
        batch_size=batch_size,
        nthreads=nthreads,
        rescale_with_baseline=rescale_with_baseline,
        baseline_path=baseline_path,
        use_fast_tokenizer=use_fast_tokenizer,
    )

BleuPrecisionScoreCalculator

Bases: ScoreCalculator

Calculator for computing BLEU scores.

Methods:

Name	Description
`calculate`	Calculates BLEU precision scores for the supplied model prediction and reference input.
`calculate_async`	Calculates BLEU precision scores for the supplied model prediction and reference input.

Source code in evalsense/evaluation/evaluators/bleu.py

class BleuPrecisionScoreCalculator(ScoreCalculator):
    """Calculator for computing BLEU scores."""

    def __init__(self):
        self.bleu_module = evaluate.load("bleu")

    @override
    def calculate(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        **kwargs: dict,
    ) -> Score:
        """
        Calculates BLEU precision scores for the supplied model prediction and reference input.

        Args:
            prediction (str): The text of the prediction from the model.
            input (str, optional): The text of the input to the model. Ignored for BLEU.
            reference (str, optional): The text of the reference input to compare against.
            metadata (dict[str, Any], optional): Additional metadata for the score.
                Ignored for BLEU.

        Returns:
            Score: Inspect AI Score with the calculated evaluation results.
        """
        if reference is None:
            raise ValueError(
                "Reference is required for computing BLEU precision, but was None."
            )

        predictions = [prediction]
        references = [reference]

        result = self.bleu_module.compute(
            predictions=predictions, references=references
        )
        return Score(
            value=result["precisions"][0],  # type: ignore
            answer=prediction,
            metadata={
                "prediction": prediction,
                "reference": reference,
            },
        )

    @override
    async def calculate_async(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        **kwargs: dict,
    ) -> Score:
        """
        Calculates BLEU precision scores for the supplied model prediction and reference input.

        Args:
            prediction (str): The text of the prediction from the model.
            input (str, optional): The text of the input to the model. Ignored for BLEU.
            reference (str, optional): The text of the reference input to compare against.
            metadata (dict[str, Any], optional): Additional metadata for the score.
                Ignored for BLEU.

        Returns:
            Score: Inspect AI Score with the calculated evaluation results.
        """
        return self.calculate(
            prediction=prediction,
            reference=reference,
            input=input,
            metadata=metadata,
            **kwargs,
        )

calculate

calculate(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score

Calculates BLEU precision scores for the supplied model prediction and reference input.

Parameters:

Name	Type	Description	Default
`prediction`	`str`	The text of the prediction from the model.	required
`input`	`str`	The text of the input to the model. Ignored for BLEU.	`None`
`reference`	`str`	The text of the reference input to compare against.	`None`
`metadata`	`dict[str, Any]`	Additional metadata for the score. Ignored for BLEU.	`None`

Returns:

Name	Type	Description
`Score`	`Score`	Inspect AI Score with the calculated evaluation results.

Source code in evalsense/evaluation/evaluators/bleu.py

@override
def calculate(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score:
    """
    Calculates BLEU precision scores for the supplied model prediction and reference input.

    Args:
        prediction (str): The text of the prediction from the model.
        input (str, optional): The text of the input to the model. Ignored for BLEU.
        reference (str, optional): The text of the reference input to compare against.
        metadata (dict[str, Any], optional): Additional metadata for the score.
            Ignored for BLEU.

    Returns:
        Score: Inspect AI Score with the calculated evaluation results.
    """
    if reference is None:
        raise ValueError(
            "Reference is required for computing BLEU precision, but was None."
        )

    predictions = [prediction]
    references = [reference]

    result = self.bleu_module.compute(
        predictions=predictions, references=references
    )
    return Score(
        value=result["precisions"][0],  # type: ignore
        answer=prediction,
        metadata={
            "prediction": prediction,
            "reference": reference,
        },
    )

calculate_async `async`

calculate_async(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score

Calculates BLEU precision scores for the supplied model prediction and reference input.

Parameters:

Name	Type	Description	Default
`prediction`	`str`	The text of the prediction from the model.	required
`input`	`str`	The text of the input to the model. Ignored for BLEU.	`None`
`reference`	`str`	The text of the reference input to compare against.	`None`
`metadata`	`dict[str, Any]`	Additional metadata for the score. Ignored for BLEU.	`None`

Returns:

Name	Type	Description
`Score`	`Score`	Inspect AI Score with the calculated evaluation results.

Source code in evalsense/evaluation/evaluators/bleu.py

@override
async def calculate_async(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score:
    """
    Calculates BLEU precision scores for the supplied model prediction and reference input.

    Args:
        prediction (str): The text of the prediction from the model.
        input (str, optional): The text of the input to the model. Ignored for BLEU.
        reference (str, optional): The text of the reference input to compare against.
        metadata (dict[str, Any], optional): Additional metadata for the score.
            Ignored for BLEU.

    Returns:
        Score: Inspect AI Score with the calculated evaluation results.
    """
    return self.calculate(
        prediction=prediction,
        reference=reference,
        input=input,
        metadata=metadata,
        **kwargs,
    )

GEvalScoreCalculator

Bases: ScoreCalculator

G-Eval score calculator.

Methods:

Name	Description
`__init__`	Initializes the G-Eval score calculator.
`calculate`	This method is not supported for G-Eval and will raise an error when called.
`calculate_async`	Calculates the G-Eval score asynchronously.

Source code in evalsense/evaluation/evaluators/g_eval.py

class GEvalScoreCalculator(ScoreCalculator):
    """G-Eval score calculator."""

    def __init__(
        self,
        model: Model,
        prompt_template: str,
        logprobs: bool = True,
        top_logprobs: int = 20,
        min_score: int = 1,
        max_score: int = 10,
        normalise: bool = True,
        debug: bool = False,
    ):
        """
        Initializes the G-Eval score calculator.

        Args:
            model (Model): The model to use for evaluation.
            prompt_template (str): The prompt template with the scoring instructions.
            logprobs (bool): Whether to use model log probabilities to compute weighted
                evaluation score instead of a standard score.
            top_logprobs (int): The number of top log probabilities to consider.
            min_score (int): The minimum valid score.
            max_score (int): The maximum valid score.
            normalise (bool): Whether to normalise the scores between 0 and 1.
            debug (bool): Whether to report repeated errors in the log.
        """
        self.model = model
        self.prompt_template = prompt_template
        self.logprobs = logprobs
        self.top_logprobs = top_logprobs
        self.min_score = min_score
        self.max_score = max_score
        self.normalise = normalise
        self.debug = debug
        self.warned_weighted_score = False

    @override
    def calculate(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        **kwargs: dict,
    ) -> Score:
        """This method is not supported for G-Eval and will raise an error when called.

        Use `calculate_async` instead.

        Raises:
            NotImplementedError: When called, as synchronous evaluation is not
                supported for G-Eval.
        """
        raise NotImplementedError(
            "Synchronous evaluation is not supported for G-Eval. "
            "Use calculate_async instead."
        )

    @override
    async def calculate_async(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        **kwargs: dict,
    ) -> Score:
        """Calculates the G-Eval score asynchronously.

        Args:
            prediction (str): The predicted output to evaluate.
            input (str | None): The input text for the model. Defaults to `None`.
            reference (str | None): The reference text for the model. Defaults to `None`.
            metadata (dict[str, Any] | None): Additional metadata for the evaluation.
                Defaults to `None`.
            **kwargs: Additional keyword arguments.

        Returns:
            Score: The calculated score.
        """
        logprobs_config = GenerateConfig(
            logprobs=self.logprobs,
            top_logprobs=self.top_logprobs,
        )
        if metadata is None:
            metadata = {}
        llm_input = format_template(
            self.prompt_template,
            prediction=prediction,
            reference=reference,
            input=input,
            **metadata,
        )
        output = await self.model.generate(llm_input, config=logprobs_config)

        raw_score = extract_score(output.completion, self.min_score, self.max_score)
        if self.logprobs:
            try:
                raw_score = extract_weighted_score(
                    output, min_score=self.min_score, max_score=self.max_score
                )
            except ValueError as e:
                if not self.warned_weighted_score or self.debug:
                    self.warned_weighted_score = True

                    error_message = (
                        f"❌  Cannot compute weighted evaluation score: {e} "
                        "Falling back to standard score."
                    )

                    if not self.debug:
                        error_message += (
                            " Further errors will be suppressed "
                            + "(set debug=True to see all errors)."
                        )
                    error_message += f" Offending output: {output.completion}"

                    logger.error(error_message)

        if self.normalise:
            score = (raw_score - self.min_score) / (self.max_score - self.min_score)
        else:
            score = raw_score

        return Score(
            value=score,
            answer=prediction,
            metadata={
                "prompt": llm_input,
                "output_text": output.completion,
                "raw_score": raw_score,
            },
        )

init

__init__(
    model: Model,
    prompt_template: str,
    logprobs: bool = True,
    top_logprobs: int = 20,
    min_score: int = 1,
    max_score: int = 10,
    normalise: bool = True,
    debug: bool = False,
)

Initializes the G-Eval score calculator.

Parameters:

Name	Type	Description	Default
`model`	`Model`	The model to use for evaluation.	required
`prompt_template`	`str`	The prompt template with the scoring instructions.	required
`logprobs`	`bool`	Whether to use model log probabilities to compute weighted evaluation score instead of a standard score.	`True`
`top_logprobs`	`int`	The number of top log probabilities to consider.	`20`
`min_score`	`int`	The minimum valid score.	`1`
`max_score`	`int`	The maximum valid score.	`10`
`normalise`	`bool`	Whether to normalise the scores between 0 and 1.	`True`
`debug`	`bool`	Whether to report repeated errors in the log.	`False`

Source code in evalsense/evaluation/evaluators/g_eval.py

def __init__(
    self,
    model: Model,
    prompt_template: str,
    logprobs: bool = True,
    top_logprobs: int = 20,
    min_score: int = 1,
    max_score: int = 10,
    normalise: bool = True,
    debug: bool = False,
):
    """
    Initializes the G-Eval score calculator.

    Args:
        model (Model): The model to use for evaluation.
        prompt_template (str): The prompt template with the scoring instructions.
        logprobs (bool): Whether to use model log probabilities to compute weighted
            evaluation score instead of a standard score.
        top_logprobs (int): The number of top log probabilities to consider.
        min_score (int): The minimum valid score.
        max_score (int): The maximum valid score.
        normalise (bool): Whether to normalise the scores between 0 and 1.
        debug (bool): Whether to report repeated errors in the log.
    """
    self.model = model
    self.prompt_template = prompt_template
    self.logprobs = logprobs
    self.top_logprobs = top_logprobs
    self.min_score = min_score
    self.max_score = max_score
    self.normalise = normalise
    self.debug = debug
    self.warned_weighted_score = False

calculate

calculate(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score

This method is not supported for G-Eval and will raise an error when called.

Use calculate_async instead.

Raises:

Type	Description
`NotImplementedError`	When called, as synchronous evaluation is not supported for G-Eval.

Source code in evalsense/evaluation/evaluators/g_eval.py

@override
def calculate(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score:
    """This method is not supported for G-Eval and will raise an error when called.

    Use `calculate_async` instead.

    Raises:
        NotImplementedError: When called, as synchronous evaluation is not
            supported for G-Eval.
    """
    raise NotImplementedError(
        "Synchronous evaluation is not supported for G-Eval. "
        "Use calculate_async instead."
    )

calculate_async `async`

calculate_async(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score

Calculates the G-Eval score asynchronously.

Parameters:

Name	Type	Description	Default
`prediction`	`str`	The predicted output to evaluate.	required
`input`	`str \| None`	The input text for the model. Defaults to `None`.	`None`
`reference`	`str \| None`	The reference text for the model. Defaults to `None`.	`None`
`metadata`	`dict[str, Any] \| None`	Additional metadata for the evaluation. Defaults to `None`.	`None`
`**kwargs`	`dict`	Additional keyword arguments.	`{}`

Returns:

Name	Type	Description
`Score`	`Score`	The calculated score.

Source code in evalsense/evaluation/evaluators/g_eval.py

@override
async def calculate_async(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score:
    """Calculates the G-Eval score asynchronously.

    Args:
        prediction (str): The predicted output to evaluate.
        input (str | None): The input text for the model. Defaults to `None`.
        reference (str | None): The reference text for the model. Defaults to `None`.
        metadata (dict[str, Any] | None): Additional metadata for the evaluation.
            Defaults to `None`.
        **kwargs: Additional keyword arguments.

    Returns:
        Score: The calculated score.
    """
    logprobs_config = GenerateConfig(
        logprobs=self.logprobs,
        top_logprobs=self.top_logprobs,
    )
    if metadata is None:
        metadata = {}
    llm_input = format_template(
        self.prompt_template,
        prediction=prediction,
        reference=reference,
        input=input,
        **metadata,
    )
    output = await self.model.generate(llm_input, config=logprobs_config)

    raw_score = extract_score(output.completion, self.min_score, self.max_score)
    if self.logprobs:
        try:
            raw_score = extract_weighted_score(
                output, min_score=self.min_score, max_score=self.max_score
            )
        except ValueError as e:
            if not self.warned_weighted_score or self.debug:
                self.warned_weighted_score = True

                error_message = (
                    f"❌  Cannot compute weighted evaluation score: {e} "
                    "Falling back to standard score."
                )

                if not self.debug:
                    error_message += (
                        " Further errors will be suppressed "
                        + "(set debug=True to see all errors)."
                    )
                error_message += f" Offending output: {output.completion}"

                logger.error(error_message)

    if self.normalise:
        score = (raw_score - self.min_score) / (self.max_score - self.min_score)
    else:
        score = raw_score

    return Score(
        value=score,
        answer=prediction,
        metadata={
            "prompt": llm_input,
            "output_text": output.completion,
            "raw_score": raw_score,
        },
    )

GEvalScorerFactory

Bases: ScorerFactory

Scorer factory for G-Eval.

Methods:

Name	Description
`__init__`	Initialize the G-Eval scorer factory.
`create_scorer`	Creates a G-Eval scorer.

Source code in evalsense/evaluation/evaluators/g_eval.py

class GEvalScorerFactory(ScorerFactory):
    """Scorer factory for G-Eval."""

    def __init__(
        self,
        name: str,
        prompt_template: str,
        metrics: list[Metric | dict[str, list[Metric]]]
        | dict[str, list[Metric]]
        | None = None,
        logprobs: bool = True,
        top_logprobs: int = 20,
        min_score: int = 1,
        max_score: int = 10,
        normalise: bool = True,
        debug: bool = False,
    ):
        """
        Initialize the G-Eval scorer factory.

        Args:
            name (str): The name of the scorer.
            prompt_template (str): The prompt template with the scoring instructions.
            metrics (list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None):
                The metrics to use for the evaluation. If `None`, the default metric
                will be used (G-Eval with mean aggregation).
            logprobs (bool): Whether to use model log probabilities to compute weighted
                evaluation score instead of a standard score.
            top_logprobs (int): The number of top log probabilities to consider.
            min_score (int): The minimum valid score.
            max_score (int): The maximum valid score.
            normalise (bool): Whether to normalise the scores between 0 and 1.
            debug (bool): Whether to report repeated errors in the log.
        """
        self.name = name
        self.prompt_template = prompt_template
        if metrics is None:
            metrics = [mean()]
        self.metrics = metrics
        self.logprobs = logprobs
        self.top_logprobs = top_logprobs
        self.min_score = min_score
        self.max_score = max_score
        self.normalise = normalise
        self.debug = debug

    @override
    def create_scorer(self, model: Model) -> Scorer:
        """
        Creates a G-Eval scorer.

        Args:
            model (Model): The model to create a scorer for.

        Returns:
            Scorer: The created G-Eval scorer.
        """

        @scorer(name=self.name, metrics=self.metrics)
        def g_eval_scorer() -> Scorer:
            g_eval_calculator = GEvalScoreCalculator(
                model=model,
                prompt_template=self.prompt_template,
                logprobs=self.logprobs,
                top_logprobs=self.top_logprobs,
                min_score=self.min_score,
                max_score=self.max_score,
                normalise=self.normalise,
                debug=self.debug,
            )

            async def score(state: TaskState, target: Target):
                return await g_eval_calculator.calculate_async(
                    input=state.input_text,
                    prediction=state.output.completion,
                    reference=target.text,
                    metadata=state.metadata,
                )

            return score

        return g_eval_scorer()

init

__init__(
    name: str,
    prompt_template: str,
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
    logprobs: bool = True,
    top_logprobs: int = 20,
    min_score: int = 1,
    max_score: int = 10,
    normalise: bool = True,
    debug: bool = False,
)

Initialize the G-Eval scorer factory.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the scorer.	required
`prompt_template`	`str`	The prompt template with the scoring instructions.	required
`metrics`	`list[Metric \| dict[str, list[Metric]]] \| dict[str, list[Metric]] \| None`	The metrics to use for the evaluation. If `None`, the default metric will be used (G-Eval with mean aggregation).	`None`
`logprobs`	`bool`	Whether to use model log probabilities to compute weighted evaluation score instead of a standard score.	`True`
`top_logprobs`	`int`	The number of top log probabilities to consider.	`20`
`min_score`	`int`	The minimum valid score.	`1`
`max_score`	`int`	The maximum valid score.	`10`
`normalise`	`bool`	Whether to normalise the scores between 0 and 1.	`True`
`debug`	`bool`	Whether to report repeated errors in the log.	`False`

Source code in evalsense/evaluation/evaluators/g_eval.py

def __init__(
    self,
    name: str,
    prompt_template: str,
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
    logprobs: bool = True,
    top_logprobs: int = 20,
    min_score: int = 1,
    max_score: int = 10,
    normalise: bool = True,
    debug: bool = False,
):
    """
    Initialize the G-Eval scorer factory.

    Args:
        name (str): The name of the scorer.
        prompt_template (str): The prompt template with the scoring instructions.
        metrics (list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None):
            The metrics to use for the evaluation. If `None`, the default metric
            will be used (G-Eval with mean aggregation).
        logprobs (bool): Whether to use model log probabilities to compute weighted
            evaluation score instead of a standard score.
        top_logprobs (int): The number of top log probabilities to consider.
        min_score (int): The minimum valid score.
        max_score (int): The maximum valid score.
        normalise (bool): Whether to normalise the scores between 0 and 1.
        debug (bool): Whether to report repeated errors in the log.
    """
    self.name = name
    self.prompt_template = prompt_template
    if metrics is None:
        metrics = [mean()]
    self.metrics = metrics
    self.logprobs = logprobs
    self.top_logprobs = top_logprobs
    self.min_score = min_score
    self.max_score = max_score
    self.normalise = normalise
    self.debug = debug

create_scorer

create_scorer(model: Model) -> Scorer

Creates a G-Eval scorer.

Parameters:

Name	Type	Description	Default
`model`	`Model`	The model to create a scorer for.	required

Returns:

Name	Type	Description
`Scorer`	`Scorer`	The created G-Eval scorer.

Source code in evalsense/evaluation/evaluators/g_eval.py

@override
def create_scorer(self, model: Model) -> Scorer:
    """
    Creates a G-Eval scorer.

    Args:
        model (Model): The model to create a scorer for.

    Returns:
        Scorer: The created G-Eval scorer.
    """

    @scorer(name=self.name, metrics=self.metrics)
    def g_eval_scorer() -> Scorer:
        g_eval_calculator = GEvalScoreCalculator(
            model=model,
            prompt_template=self.prompt_template,
            logprobs=self.logprobs,
            top_logprobs=self.top_logprobs,
            min_score=self.min_score,
            max_score=self.max_score,
            normalise=self.normalise,
            debug=self.debug,
        )

        async def score(state: TaskState, target: Target):
            return await g_eval_calculator.calculate_async(
                input=state.input_text,
                prediction=state.output.completion,
                reference=target.text,
                metadata=state.metadata,
            )

        return score

    return g_eval_scorer()

QagsConfig

Bases: Protocol

A protocol for configuring QAGS evaluation.

Methods:

Name	Description
`__init__`	Initializes the QAGS configuration.
`enforce_not_none`	Helper method to enforce that a parameter is not None.
`get_answer_comparison_prompt`	Constructs the prompt for comparing answers to the generated questions.
`get_answer_generation_prompt`	Constructs the prompt for generating the answer to a single question.
`get_question_generation_prompt`	Constructs the prompt for generating the questions for the model output.

Source code in evalsense/evaluation/evaluators/qags.py

class QagsConfig(Protocol):
    """A protocol for configuring QAGS evaluation."""

    answer_comparison_mode: Literal["ternary", "exact", "judge"]
    logprobs: bool
    top_logprobs: int
    ci: float
    debug: bool

    def __init__(
        self,
        answer_comparison_mode: Literal["ternary", "exact", "judge"],
        logprobs: bool = True,
        top_logprobs: int = 20,
        ci: float = 0.1,
        debug: bool = False,
    ):
        """
        Initializes the QAGS configuration.

        Args:
            answer_comparison_mode (Literal["ternary", "exact", "judge"]): The mode
                for comparing answers. Either "ternary", "exact", or "judge".
                In "ternary" mode, the model is expected to answer the generated
                questions with "yes", "no", or "unknown". In other modes, the model
                may give arbitrary answers, which are either compared in terms
                of exact match or compared by the model itself.
            logprobs (bool): Whether to use logprobs to compute weighted answers. Can only
                be used when `answer_comparison_mode` is set to "judge".
            top_logprobs (int): The number of top log probabilities to consider
                when computing weighted answers.
            ci (float): The range near the extreme values (0.0 or 1.0) in which
                to consider the model answer as confident when comparing answers.
                This only affects the score explanation when `answer_comparison_mode`
                is set to "judge". The default value is 0.1, which means that
                answers with a score of 0.9 or are confident "yes", while answers
                with a score of 0.1 or lower are confident "no".
            debug (bool): Whether to report repeated errors in the log.
        """
        self.answer_comparison_mode = answer_comparison_mode
        self.logprobs = logprobs
        self.top_logprobs = top_logprobs
        self.ci = ci
        self.debug = debug

    def enforce_not_none[T](self, param_name: str, param_value: T | None) -> T:
        """
        Helper method to enforce that a parameter is not None.

        Args:
            param_name (str): The name of the parameter.
            param_value (T | None): The value of the parameter.

        Raises:
            ValueError: If the parameter value is None.

        Returns:
            T: The parameter value if it is not None.
        """
        if param_value is None:
            raise ValueError(f"{param_name} cannot be None.")
        return param_value

    @abstractmethod
    def get_question_generation_prompt(
        self,
        *,
        source: Literal["prediction", "reference"],
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> str:
        """
        Constructs the prompt for generating the questions for the model output.

        The prompt should instruct the model to generate each question on
        a separate line.

        Args:
            source (Literal["prediction", "reference"]): The source to use for
                generating the questions. Either "prediction" or "reference".
                According to the source, the generated prompt should either use
                the model output or the reference output/input. When
                `answer_comparison_mode` is set to "ternary", the generated
                questions should be answerable with "yes", "no", or "unknown".
            prediction (str, optional): The model output to evaluate.
            input (str, optional): The input to the model. Optional.
            reference (str, optional): The reference output to compare against.
                Optional.
            metadata (dict[str, Any], optional): Additional Inspect AI sample/task
                state metadata. Optional.

        Returns:
            str: The generated prompt.
        """
        ...

    @abstractmethod
    def get_answer_generation_prompt(
        self,
        *,
        source: Literal["prediction", "reference"],
        question: str,
        prediction: str | None = None,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> str:
        """
        Constructs the prompt for generating the answer to a single question.

        Args:
            source (Literal["prediction", "reference"]): The source to use for
                generating the answer. Either "prediction" or "reference".
                According to the source, the generated prompt should either use
                the model output or the reference output/input when asking
                the model to answer the question. When `answer_comparison_mode`
                is set to "ternary", the prompt should instruct the model to
                answer the question with "yes", "no", or "unknown". Otherwise,
                the model should be instructed to give an answer only without
                any further comments.
            prediction (str, optional): The model output to evaluate.
            input (str, optional): The input to the model. Optional.
            reference (str, optional): The reference output to compare against.
                Optional.
            metadata (dict[str, Any], optional): Additional Inspect AI sample/task
                state metadata. Optional.

        Returns:
            str: The generated prompt.
        """
        ...

    def get_answer_comparison_prompt(
        self,
        *,
        question: str,
        prediction_answer: str,
        reference_answer: str,
        input: str | None = None,
        prediction: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> str:
        """
        Constructs the prompt for comparing answers to the generated questions.

        This method is only used when `answer_comparison_mode` is set to "judge".

        Args:
            question (str): The question to compare answers for.
            prediction_answer (str): The answer generated from the model output.
            reference_answer (str): The answer generated from the reference output.
            input (str | None, optional): The input to the model. Optional.
            prediction (str | None, optional): The model output to evaluate. Optional.
            reference (str | None, optional): The reference output to compare against.
                Optional.
            metadata (dict[str, Any] | None, optional): Additional Inspect AI sample/task
                state metadata. Optional.

        Returns:
            str: The generated prompt.
        """
        if self.answer_comparison_mode == "judge":
            raise NotImplementedError(
                "Answer comparison prompt generation is not implemented. "
                "If you want to use QAGS in judge mode, please implement this method."
            )
        assert False, (
            "Should not attempt to generate comparison prompt in non-judge mode."
        )

init

__init__(
    answer_comparison_mode: Literal[
        "ternary", "exact", "judge"
    ],
    logprobs: bool = True,
    top_logprobs: int = 20,
    ci: float = 0.1,
    debug: bool = False,
)

Initializes the QAGS configuration.

Parameters:

Name	Type	Description	Default
`answer_comparison_mode`	`Literal['ternary', 'exact', 'judge']`	The mode for comparing answers. Either "ternary", "exact", or "judge". In "ternary" mode, the model is expected to answer the generated questions with "yes", "no", or "unknown". In other modes, the model may give arbitrary answers, which are either compared in terms of exact match or compared by the model itself.	required
`logprobs`	`bool`	Whether to use logprobs to compute weighted answers. Can only be used when `answer_comparison_mode` is set to "judge".	`True`
`top_logprobs`	`int`	The number of top log probabilities to consider when computing weighted answers.	`20`
`ci`	`float`	The range near the extreme values (0.0 or 1.0) in which to consider the model answer as confident when comparing answers. This only affects the score explanation when `answer_comparison_mode` is set to "judge". The default value is 0.1, which means that answers with a score of 0.9 or are confident "yes", while answers with a score of 0.1 or lower are confident "no".	`0.1`
`debug`	`bool`	Whether to report repeated errors in the log.	`False`

Source code in evalsense/evaluation/evaluators/qags.py

def __init__(
    self,
    answer_comparison_mode: Literal["ternary", "exact", "judge"],
    logprobs: bool = True,
    top_logprobs: int = 20,
    ci: float = 0.1,
    debug: bool = False,
):
    """
    Initializes the QAGS configuration.

    Args:
        answer_comparison_mode (Literal["ternary", "exact", "judge"]): The mode
            for comparing answers. Either "ternary", "exact", or "judge".
            In "ternary" mode, the model is expected to answer the generated
            questions with "yes", "no", or "unknown". In other modes, the model
            may give arbitrary answers, which are either compared in terms
            of exact match or compared by the model itself.
        logprobs (bool): Whether to use logprobs to compute weighted answers. Can only
            be used when `answer_comparison_mode` is set to "judge".
        top_logprobs (int): The number of top log probabilities to consider
            when computing weighted answers.
        ci (float): The range near the extreme values (0.0 or 1.0) in which
            to consider the model answer as confident when comparing answers.
            This only affects the score explanation when `answer_comparison_mode`
            is set to "judge". The default value is 0.1, which means that
            answers with a score of 0.9 or are confident "yes", while answers
            with a score of 0.1 or lower are confident "no".
        debug (bool): Whether to report repeated errors in the log.
    """
    self.answer_comparison_mode = answer_comparison_mode
    self.logprobs = logprobs
    self.top_logprobs = top_logprobs
    self.ci = ci
    self.debug = debug

enforce_not_none

enforce_not_none(
    param_name: str, param_value: T | None
) -> T

Helper method to enforce that a parameter is not None.

Parameters:

Name	Type	Description	Default
`param_name`	`str`	The name of the parameter.	required
`param_value`	`T \| None`	The value of the parameter.	required

Raises:

Type	Description
`ValueError`	If the parameter value is None.

Returns:

Name	Type	Description
`T`	`T`	The parameter value if it is not None.

Source code in evalsense/evaluation/evaluators/qags.py

def enforce_not_none[T](self, param_name: str, param_value: T | None) -> T:
    """
    Helper method to enforce that a parameter is not None.

    Args:
        param_name (str): The name of the parameter.
        param_value (T | None): The value of the parameter.

    Raises:
        ValueError: If the parameter value is None.

    Returns:
        T: The parameter value if it is not None.
    """
    if param_value is None:
        raise ValueError(f"{param_name} cannot be None.")
    return param_value

get_answer_comparison_prompt

get_answer_comparison_prompt(
    *,
    question: str,
    prediction_answer: str,
    reference_answer: str,
    input: str | None = None,
    prediction: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
) -> str

Constructs the prompt for comparing answers to the generated questions.

This method is only used when answer_comparison_mode is set to "judge".

Parameters:

Name	Type	Description	Default
`question`	`str`	The question to compare answers for.	required
`prediction_answer`	`str`	The answer generated from the model output.	required
`reference_answer`	`str`	The answer generated from the reference output.	required
`input`	`str \| None`	The input to the model. Optional.	`None`
`prediction`	`str \| None`	The model output to evaluate. Optional.	`None`
`reference`	`str \| None`	The reference output to compare against. Optional.	`None`
`metadata`	`dict[str, Any] \| None`	Additional Inspect AI sample/task state metadata. Optional.	`None`

Returns:

Name	Type	Description
`str`	`str`	The generated prompt.

Source code in evalsense/evaluation/evaluators/qags.py

def get_answer_comparison_prompt(
    self,
    *,
    question: str,
    prediction_answer: str,
    reference_answer: str,
    input: str | None = None,
    prediction: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
) -> str:
    """
    Constructs the prompt for comparing answers to the generated questions.

    This method is only used when `answer_comparison_mode` is set to "judge".

    Args:
        question (str): The question to compare answers for.
        prediction_answer (str): The answer generated from the model output.
        reference_answer (str): The answer generated from the reference output.
        input (str | None, optional): The input to the model. Optional.
        prediction (str | None, optional): The model output to evaluate. Optional.
        reference (str | None, optional): The reference output to compare against.
            Optional.
        metadata (dict[str, Any] | None, optional): Additional Inspect AI sample/task
            state metadata. Optional.

    Returns:
        str: The generated prompt.
    """
    if self.answer_comparison_mode == "judge":
        raise NotImplementedError(
            "Answer comparison prompt generation is not implemented. "
            "If you want to use QAGS in judge mode, please implement this method."
        )
    assert False, (
        "Should not attempt to generate comparison prompt in non-judge mode."
    )

get_answer_generation_prompt `abstractmethod`

get_answer_generation_prompt(
    *,
    source: Literal["prediction", "reference"],
    question: str,
    prediction: str | None = None,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
) -> str

Constructs the prompt for generating the answer to a single question.

Parameters:

Name	Type	Description	Default
`source`	`Literal['prediction', 'reference']`	The source to use for generating the answer. Either "prediction" or "reference". According to the source, the generated prompt should either use the model output or the reference output/input when asking the model to answer the question. When `answer_comparison_mode` is set to "ternary", the prompt should instruct the model to answer the question with "yes", "no", or "unknown". Otherwise, the model should be instructed to give an answer only without any further comments.	required
`prediction`	`str`	The model output to evaluate.	`None`
`input`	`str`	The input to the model. Optional.	`None`
`reference`	`str`	The reference output to compare against. Optional.	`None`
`metadata`	`dict[str, Any]`	Additional Inspect AI sample/task state metadata. Optional.	`None`

Returns:

Name	Type	Description
`str`	`str`	The generated prompt.

Source code in evalsense/evaluation/evaluators/qags.py

@abstractmethod
def get_answer_generation_prompt(
    self,
    *,
    source: Literal["prediction", "reference"],
    question: str,
    prediction: str | None = None,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
) -> str:
    """
    Constructs the prompt for generating the answer to a single question.

    Args:
        source (Literal["prediction", "reference"]): The source to use for
            generating the answer. Either "prediction" or "reference".
            According to the source, the generated prompt should either use
            the model output or the reference output/input when asking
            the model to answer the question. When `answer_comparison_mode`
            is set to "ternary", the prompt should instruct the model to
            answer the question with "yes", "no", or "unknown". Otherwise,
            the model should be instructed to give an answer only without
            any further comments.
        prediction (str, optional): The model output to evaluate.
        input (str, optional): The input to the model. Optional.
        reference (str, optional): The reference output to compare against.
            Optional.
        metadata (dict[str, Any], optional): Additional Inspect AI sample/task
            state metadata. Optional.

    Returns:
        str: The generated prompt.
    """
    ...

get_question_generation_prompt `abstractmethod`

get_question_generation_prompt(
    *,
    source: Literal["prediction", "reference"],
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
) -> str

Constructs the prompt for generating the questions for the model output.

The prompt should instruct the model to generate each question on a separate line.

Parameters:

Name	Type	Description	Default
`source`	`Literal['prediction', 'reference']`	The source to use for generating the questions. Either "prediction" or "reference". According to the source, the generated prompt should either use the model output or the reference output/input. When `answer_comparison_mode` is set to "ternary", the generated questions should be answerable with "yes", "no", or "unknown".	required
`prediction`	`str`	The model output to evaluate.	required
`input`	`str`	The input to the model. Optional.	`None`
`reference`	`str`	The reference output to compare against. Optional.	`None`
`metadata`	`dict[str, Any]`	Additional Inspect AI sample/task state metadata. Optional.	`None`

Returns:

Name	Type	Description
`str`	`str`	The generated prompt.

Source code in evalsense/evaluation/evaluators/qags.py

@abstractmethod
def get_question_generation_prompt(
    self,
    *,
    source: Literal["prediction", "reference"],
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
) -> str:
    """
    Constructs the prompt for generating the questions for the model output.

    The prompt should instruct the model to generate each question on
    a separate line.

    Args:
        source (Literal["prediction", "reference"]): The source to use for
            generating the questions. Either "prediction" or "reference".
            According to the source, the generated prompt should either use
            the model output or the reference output/input. When
            `answer_comparison_mode` is set to "ternary", the generated
            questions should be answerable with "yes", "no", or "unknown".
        prediction (str, optional): The model output to evaluate.
        input (str, optional): The input to the model. Optional.
        reference (str, optional): The reference output to compare against.
            Optional.
        metadata (dict[str, Any], optional): Additional Inspect AI sample/task
            state metadata. Optional.

    Returns:
        str: The generated prompt.
    """
    ...

QagsScoreCalculator

Bases: ScoreCalculator

QAGS score calculator.

Methods:

Name	Description
`__init__`	Initializes the QAGS score calculator.
`calculate`	This method is not supported for QAGS and will raise an error when called.
`calculate_async`	Asynchronously computes evaluation scores for QAGS.

Attributes:

Name	Type	Description
`generate_config`	`GenerateConfig`	Generation configuration for the model.

Source code in evalsense/evaluation/evaluators/qags.py

class QagsScoreCalculator(ScoreCalculator):
    """QAGS score calculator."""

    _symbol_dict: dict[bool | None, str] = {
        True: "✅",
        False: "❌",
        None: "❓",
    }

    def __init__(
        self,
        model: Model,
        config: QagsConfig,
        name: str = "QAGS",
        debug: bool = False,
    ):
        """
        Initializes the QAGS score calculator.

        Args:
            model (Model): The model to use for evaluation.
            config (QagsConfig): The configuration for the QAGS score calculator.
            name (str): The name of the score calculator. Defaults to "QAGS".
            debug (bool): Whether to report repeated errors in the log.
        """
        self.model = model
        self.config = config
        self.name = name
        self.warned_weighted_answer = False

    @property
    def generate_config(self) -> GenerateConfig:
        """Generation configuration for the model."""
        if self.config.logprobs and self.config.answer_comparison_mode == "judge":
            return GenerateConfig(
                logprobs=self.config.logprobs,
                top_logprobs=self.config.top_logprobs,
            )
        return GenerateConfig()

    @override
    def calculate(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        **kwargs: dict,
    ) -> Score:
        """This method is not supported for QAGS and will raise an error when called.

        Use `calculate_async` instead.

        Raises:
            NotImplementedError: When called, as synchronous evaluation is not
                supported for QAGS.
        """
        raise NotImplementedError(
            "Synchronous evaluation is not supported for QAGS. "
            "Use calculate_async instead."
        )

    async def _generate_questions(
        self,
        *,
        prediction: str,
        score_metadata: dict[str, Any],
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> list[str]:
        """Generates questions for the model output and reference output.

        Args:
            prediction (str): The model output to evaluate.
            score_metadata (dict[str, Any]): The dictionary for storing metadata
                associated with the evaluation, returned with the score.
            input (str | None, optional): The input to the model. Optional.
            reference (str | None, optional): The reference output to compare against.
                Optional.
            metadata (dict[str, Any] | None, optional): Additional Inspect AI sample/task
                state metadata. Optional.

        Returns:
            list[str]: A list of generated questions.
        """
        # Questions for model outputs
        prediction_questions_prompt = self.config.get_question_generation_prompt(
            source="prediction",
            prediction=prediction,
            input=input,
            reference=reference,
            metadata=metadata,
        )
        # We don't actually need the logprobs until comparing the answers,
        # but the vLLM provider uses the config from the first sample in the batch
        # so we need to use consistent config for all samples.
        prediction_questions_output = await self.model.generate(
            prediction_questions_prompt, config=self.generate_config
        )
        prediction_questions = extract_lines(
            prediction_questions_output.completion,
            include_filter_fun=lambda line: line.endswith("?"),
        )

        # Questions for reference outputs
        reference_questions_prompt = self.config.get_question_generation_prompt(
            source="reference",
            prediction=prediction,
            input=input,
            reference=reference,
            metadata=metadata,
        )
        reference_questions_output = await self.model.generate(
            reference_questions_prompt, config=self.generate_config
        )
        reference_questions = extract_lines(
            reference_questions_output.completion,
            include_filter_fun=lambda line: line.endswith("?"),
        )

        questions = prediction_questions + reference_questions

        score_metadata["questions"] = questions
        score_metadata["prediction_questions_prompt"] = prediction_questions_prompt
        score_metadata["reference_questions_prompt"] = reference_questions_prompt
        score_metadata["raw_prediction_questions"] = (
            prediction_questions_output.completion
        )
        score_metadata["raw_reference_questions"] = (
            reference_questions_output.completion
        )

        return questions

    async def _generate_answers(
        self,
        *,
        prediction: str,
        score_metadata: dict[str, Any],
        questions: list[str],
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> tuple[list[str], list[str]]:
        """Generates answers for the model output and reference output.

        Args:
            prediction (str): The model output to evaluate.
            score_metadata (dict[str, Any]): The dictionary for storing metadata
                associated with the evaluation, returned with the score.
            questions (list[str]): The list of questions to generate answers for.
            input (str | None, optional): The input to the model. Optional.
            reference (str | None, optional): The reference output to compare against.
                Optional.
            metadata (dict[str, Any] | None, optional): Additional Inspect AI sample/task
                state metadata. Optional.

        Returns:
            tuple[list[str], list[str]]: A tuple containing two lists of generated
                answers - one for the model output and one for the reference output,
                respectively.
        """
        prediction_answers: list[str] = []
        reference_answers: list[str] = []
        score_metadata["raw_prediction_answers"] = []
        score_metadata["raw_reference_answers"] = []
        score_metadata["prediction_answer_prompts"] = []
        score_metadata["reference_answer_prompts"] = []
        for question in questions:
            prediction_answer_prompt = self.config.get_answer_generation_prompt(
                source="prediction",
                question=question,
                prediction=prediction,
                input=input,
                reference=reference,
                metadata=metadata,
            )
            prediction_answer_output = await self.model.generate(
                prediction_answer_prompt, config=self.generate_config
            )
            prediction_answers.append(prediction_answer_output.completion)

            reference_answer_prompt = self.config.get_answer_generation_prompt(
                source="reference",
                question=question,
                prediction=prediction,
                input=input,
                reference=reference,
                metadata=metadata,
            )
            reference_answer_output = await self.model.generate(
                reference_answer_prompt, config=self.generate_config
            )
            reference_answers.append(reference_answer_output.completion)

            score_metadata["raw_prediction_answers"].append(
                prediction_answer_output.completion
            )
            score_metadata["raw_reference_answers"].append(
                reference_answer_output.completion
            )
            score_metadata["prediction_answer_prompts"].append(prediction_answer_prompt)
            score_metadata["reference_answer_prompts"].append(reference_answer_prompt)

        return prediction_answers, reference_answers

    def _evaluate_ternary_answers(
        self,
        *,
        prediction: str,
        questions: list[str],
        raw_prediction_answers: list[str],
        raw_reference_answers: list[str],
        score_metadata: dict[str, Any],
    ) -> Score:
        """Evaluates the answers using the ternary answer comparison mode.

        Args:
            prediction (str): The model output to evaluate.
            questions (list[str]): The list of questions generated for the model
                output.
            raw_prediction_answers (list[str]): The list of answers generated from
                the model output.
            raw_reference_answers (list[str]): The list of answers generated from
                the reference output.
            score_metadata (dict[str, Any]): The dictionary for storing metadata
                associated with the evaluation, returned with the score.

        Returns:
            Score: The Inspect AI Score object with the calculated result.
        """
        prediction_answers = [
            extract_ternary_answer(answer, binary_only=False, unknown_on_mismatch=True)
            for answer in raw_prediction_answers
        ]
        reference_answers = [
            extract_ternary_answer(answer, binary_only=False, unknown_on_mismatch=True)
            for answer in raw_reference_answers
        ]

        ref_positive = sum([ra is True for ra in reference_answers])
        pred_positive = sum([pa is True for pa in prediction_answers])
        true_positive = sum(
            [
                pa == ra and ra is True
                for pa, ra in zip(prediction_answers, reference_answers)
            ]
        )
        total_correct = sum(
            [pa == ra for pa, ra in zip(prediction_answers, reference_answers)]
        )

        coverage = true_positive / ref_positive if ref_positive > 0 else 0.0
        groundedness = true_positive / pred_positive if pred_positive > 0 else 0.0
        accuracy = (
            total_correct / len(prediction_answers)
            if len(prediction_answers) > 0
            else 0.0
        )

        explanation = "QAGS Evaluation Report\n\n\nMismatched Q&As\n"
        for i, (question, pa, ra) in enumerate(
            zip(questions, prediction_answers, reference_answers)
        ):
            if pa == ra:
                continue
            explanation += (
                f"* [{i}] Q: {question}, PA: {self._symbol_dict.get(pa)}, "
                f"RA: {self._symbol_dict.get(ra)}, "
                f"Match: {self._symbol_dict.get(False)}\n"
            )
        explanation += "\n\nAll Q&As\n"
        for i, (question, pa, ra) in enumerate(
            zip(questions, prediction_answers, reference_answers)
        ):
            explanation += (
                f"* [{i}] Q: {question}, PA: {self._symbol_dict.get(pa)}, "
                f"RA: {self._symbol_dict.get(ra)}, "
                f"Match: {self._symbol_dict.get(pa == ra)}\n"
            )
        explanation += (
            "\n\n"
            + f"Coverage: {coverage:.2f} ({true_positive}/{ref_positive})\n"
            + f"Groundedness: {groundedness:.2f} ({true_positive}/{pred_positive})\n"
            + f"Accuracy: {accuracy:.2f} ({total_correct}/{len(prediction_answers)})"
        )

        score_metadata["prediction_answers"] = prediction_answers
        score_metadata["reference_answers"] = reference_answers
        score_metadata = {"explanation": explanation} | score_metadata

        return Score(
            value={
                f"{self.name} Coverage": coverage,
                f"{self.name} Groundedness": groundedness,
                f"{self.name} Accuracy": accuracy,
            },
            answer=prediction,
            explanation=explanation,
            metadata=score_metadata,
        )

    def _evaluate_exact_answers(
        self,
        *,
        prediction: str,
        questions: list[str],
        raw_prediction_answers: list[str],
        raw_reference_answers: list[str],
        score_metadata: dict[str, Any],
    ) -> Score:
        """Evaluates the answers using the exact answer comparison mode.

        Args:
            prediction (str): The model output to evaluate.
            questions (list[str]): The list of questions generated for the model
                output.
            raw_prediction_answers (list[str]): The list of answers generated from
                the model output.
            raw_reference_answers (list[str]): The list of answers generated from
                the reference output.
            score_metadata (dict[str, Any]): The dictionary for storing metadata
                associated with the evaluation, returned with the score.

        Returns:
            Score: The Inspect AI Score object with the calculated result.
        """
        prediction_answers = [pa.strip().lower() for pa in raw_prediction_answers]
        reference_answers = [ra.strip().lower() for ra in raw_reference_answers]
        total_correct = sum(
            [pa == ra for pa, ra in zip(prediction_answers, reference_answers)]
        )
        accuracy = (
            total_correct / len(prediction_answers)
            if len(prediction_answers) > 0
            else 0.0
        )

        explanation = "QAGS Evaluation Report\n\n\nMismatched Q&As\n"
        for i, (question, pa, ra) in enumerate(
            zip(questions, prediction_answers, reference_answers)
        ):
            if pa == ra:
                continue
            explanation += (
                f"* [{i}] Q: {question}, PA: {pa}, RA: {ra}, "
                f"Match: {self._symbol_dict.get(False)}\n"
            )
        explanation += "\n\nAll Q&As\n"
        for i, (question, pa, ra) in enumerate(
            zip(questions, prediction_answers, reference_answers)
        ):
            explanation += (
                f"* [{i}] Q: {question}, PA: {pa}, RA: {ra},"
                f"Match: {self._symbol_dict.get(pa == ra)}\n"
            )
        explanation += (
            f"\n\nAccuracy: {accuracy:.2f} ({total_correct}/{len(prediction_answers)})"
        )

        score_metadata["prediction_answers"] = prediction_answers
        score_metadata["reference_answers"] = reference_answers
        score_metadata = {"explanation": explanation} | score_metadata

        return Score(
            value=accuracy,
            answer=prediction,
            explanation=explanation,
            metadata=score_metadata,
        )

    async def _evaluate_judge_answers(
        self,
        prediction: str,
        questions: list[str],
        raw_prediction_answers: list[str],
        raw_reference_answers: list[str],
        score_metadata: dict[str, Any],
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> Score:
        """Evaluates the answers using the judge answer comparison mode.

        Args:
            prediction (str): The model output to evaluate.
            questions (list[str]): The list of questions generated for the model
                output.
            raw_prediction_answers (list[str]): The list of answers generated from
                the model output.
            raw_reference_answers (list[str]): The list of answers generated from
                the reference output.
            score_metadata (dict[str, Any]): The dictionary for storing metadata
                associated with the evaluation, returned with the score.
            input (str | None, optional): The input to the model. Optional.
            reference (str | None, optional): The reference output to compare against.
                Optional.
            metadata (dict[str, Any] | None, optional): Additional Inspect AI sample/task
                state metadata. Optional.

        Returns:
            Score: The Inspect AI Score object with the calculated result.
        """
        answer_comparisons: list[float] = []
        for question, prediction_answer, reference_answer in zip(
            questions,
            raw_prediction_answers,
            raw_reference_answers,
        ):
            answer_comparison_prompt = self.config.get_answer_comparison_prompt(
                question=question,
                prediction_answer=prediction_answer,
                reference_answer=reference_answer,
                input=input,
                prediction=prediction,
                reference=reference,
                metadata=metadata,
            )
            answer_comparison_output = await self.model.generate(
                answer_comparison_prompt, config=self.generate_config
            )
            answer_comparison = float(
                extract_ternary_answer(
                    answer_comparison_output.completion,
                    binary_only=True,
                    unknown_on_mismatch=False,
                )
            )
            if self.config.logprobs:
                try:
                    answer_comparison = extract_weighted_binary_answer(
                        answer_comparison_output
                    )
                except ValueError as e:
                    if not self.warned_weighted_answer or self.config.debug:
                        self.warned_weighted_answer = True

                        error_message = (
                            f"❌  Cannot compute weighted comparison score: {e} "
                            "Falling back to binary comparison."
                        )

                        if not self.config.debug:
                            error_message += (
                                " Further errors will be suppressed "
                                + "(set debug=True to see all errors)."
                            )

                        logger.error(error_message)
            answer_comparisons.append(answer_comparison)

        def to_match_symbol(answer_comparison: float) -> str:
            if answer_comparison > 1 - self.config.ci:
                return self._symbol_dict[True]
            elif answer_comparison < self.config.ci:
                return self._symbol_dict[False]
            else:
                return self._symbol_dict[None]

        accuracy = sum(answer_comparisons) / len(answer_comparisons)

        explanation = "QAGS Evaluation Report\n\n\nMismatched Q&As\n"
        for i, (question, pa, ra, ac) in enumerate(
            zip(
                questions,
                raw_prediction_answers,
                raw_reference_answers,
                answer_comparisons,
            )
        ):
            if ac > 1 - self.config.ci:
                continue
            explanation += (
                f"* [{i}] Q: {question}, PA: {pa}, RA: {ra}, Score: {ac:.2f}, "
                f"Match: {to_match_symbol(ac)}\n"
            )
        explanation += "\n\nAll Q&As\n"
        for i, (question, pa, ra, ac) in enumerate(
            zip(
                questions,
                raw_prediction_answers,
                raw_reference_answers,
                answer_comparisons,
            )
        ):
            explanation += (
                f"* [{i}] Q: {question}, PA: {pa}, RA: {ra}, Score: {ac:.2f}, "
                f"Match: {to_match_symbol(ac)}\n"
            )
        explanation += (
            f"\n\nAccuracy: {accuracy:.2f} "
            + f"({sum(answer_comparisons):.2f}/{len(answer_comparisons)})"
        )

        score_metadata["prediction_answers"] = raw_prediction_answers
        score_metadata["reference_answers"] = raw_reference_answers
        score_metadata["answer_comparisons"] = answer_comparisons
        score_metadata = {"explanation": explanation} | score_metadata

        return Score(
            value=accuracy,
            answer=prediction,
            explanation=explanation,
            metadata=score_metadata,
        )

    @override
    async def calculate_async(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        **kwargs: dict,
    ) -> Score:
        """Asynchronously computes evaluation scores for QAGS.

        Args:
            prediction (str): The model output to evaluate.
            input (str, optional): The input to the model. Optional.
            reference (str, optional): The reference output to compare against.
                Optional.
            metadata (dict[str, Any], optional): Additional Inspect AI sample/task
                state metadata. Optional.
            **kwargs (dict): Additional keyword arguments specific to the given
                evaluation method.

        Returns:
            Score: The Inspect AI Score object with the calculated result.
        """

        score_metadata = {}

        all_questions = await self._generate_questions(
            prediction=prediction,
            score_metadata=score_metadata,
            input=input,
            reference=reference,
            metadata=metadata,
        )

        prediction_answers, reference_answers = await self._generate_answers(
            prediction=prediction,
            score_metadata=score_metadata,
            questions=all_questions,
            input=input,
            reference=reference,
            metadata=metadata,
        )

        match self.config.answer_comparison_mode:
            case "ternary":
                return self._evaluate_ternary_answers(
                    prediction=prediction,
                    questions=all_questions,
                    raw_prediction_answers=prediction_answers,
                    raw_reference_answers=reference_answers,
                    score_metadata=score_metadata,
                )
            case "exact":
                return self._evaluate_exact_answers(
                    prediction=prediction,
                    questions=all_questions,
                    raw_prediction_answers=prediction_answers,
                    raw_reference_answers=reference_answers,
                    score_metadata=score_metadata,
                )
            case "judge":
                return await self._evaluate_judge_answers(
                    prediction=prediction,
                    questions=all_questions,
                    raw_prediction_answers=prediction_answers,
                    raw_reference_answers=reference_answers,
                    score_metadata=score_metadata,
                    input=input,
                    reference=reference,
                    metadata=metadata,
                )
            case _:
                raise ValueError(
                    f"Invalid answer comparison mode: {self.config.answer_comparison_mode}. "
                    "Expected one of 'ternary', 'exact', 'judge'."
                )

generate_config `property`

generate_config: GenerateConfig

Generation configuration for the model.

init

__init__(
    model: Model,
    config: QagsConfig,
    name: str = "QAGS",
    debug: bool = False,
)

Initializes the QAGS score calculator.

Parameters:

Name	Type	Description	Default
`model`	`Model`	The model to use for evaluation.	required
`config`	`QagsConfig`	The configuration for the QAGS score calculator.	required
`name`	`str`	The name of the score calculator. Defaults to "QAGS".	`'QAGS'`
`debug`	`bool`	Whether to report repeated errors in the log.	`False`

Source code in evalsense/evaluation/evaluators/qags.py

def __init__(
    self,
    model: Model,
    config: QagsConfig,
    name: str = "QAGS",
    debug: bool = False,
):
    """
    Initializes the QAGS score calculator.

    Args:
        model (Model): The model to use for evaluation.
        config (QagsConfig): The configuration for the QAGS score calculator.
        name (str): The name of the score calculator. Defaults to "QAGS".
        debug (bool): Whether to report repeated errors in the log.
    """
    self.model = model
    self.config = config
    self.name = name
    self.warned_weighted_answer = False

calculate

calculate(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score

This method is not supported for QAGS and will raise an error when called.

Use calculate_async instead.

Raises:

Type	Description
`NotImplementedError`	When called, as synchronous evaluation is not supported for QAGS.

Source code in evalsense/evaluation/evaluators/qags.py

@override
def calculate(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score:
    """This method is not supported for QAGS and will raise an error when called.

    Use `calculate_async` instead.

    Raises:
        NotImplementedError: When called, as synchronous evaluation is not
            supported for QAGS.
    """
    raise NotImplementedError(
        "Synchronous evaluation is not supported for QAGS. "
        "Use calculate_async instead."
    )

calculate_async `async`

calculate_async(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score

Asynchronously computes evaluation scores for QAGS.

Parameters:

Name	Type	Description	Default
`prediction`	`str`	The model output to evaluate.	required
`input`	`str`	The input to the model. Optional.	`None`
`reference`	`str`	The reference output to compare against. Optional.	`None`
`metadata`	`dict[str, Any]`	Additional Inspect AI sample/task state metadata. Optional.	`None`
`**kwargs`	`dict`	Additional keyword arguments specific to the given evaluation method.	`{}`

Returns:

Name	Type	Description
`Score`	`Score`	The Inspect AI Score object with the calculated result.

Source code in evalsense/evaluation/evaluators/qags.py

@override
async def calculate_async(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score:
    """Asynchronously computes evaluation scores for QAGS.

    Args:
        prediction (str): The model output to evaluate.
        input (str, optional): The input to the model. Optional.
        reference (str, optional): The reference output to compare against.
            Optional.
        metadata (dict[str, Any], optional): Additional Inspect AI sample/task
            state metadata. Optional.
        **kwargs (dict): Additional keyword arguments specific to the given
            evaluation method.

    Returns:
        Score: The Inspect AI Score object with the calculated result.
    """

    score_metadata = {}

    all_questions = await self._generate_questions(
        prediction=prediction,
        score_metadata=score_metadata,
        input=input,
        reference=reference,
        metadata=metadata,
    )

    prediction_answers, reference_answers = await self._generate_answers(
        prediction=prediction,
        score_metadata=score_metadata,
        questions=all_questions,
        input=input,
        reference=reference,
        metadata=metadata,
    )

    match self.config.answer_comparison_mode:
        case "ternary":
            return self._evaluate_ternary_answers(
                prediction=prediction,
                questions=all_questions,
                raw_prediction_answers=prediction_answers,
                raw_reference_answers=reference_answers,
                score_metadata=score_metadata,
            )
        case "exact":
            return self._evaluate_exact_answers(
                prediction=prediction,
                questions=all_questions,
                raw_prediction_answers=prediction_answers,
                raw_reference_answers=reference_answers,
                score_metadata=score_metadata,
            )
        case "judge":
            return await self._evaluate_judge_answers(
                prediction=prediction,
                questions=all_questions,
                raw_prediction_answers=prediction_answers,
                raw_reference_answers=reference_answers,
                score_metadata=score_metadata,
                input=input,
                reference=reference,
                metadata=metadata,
            )
        case _:
            raise ValueError(
                f"Invalid answer comparison mode: {self.config.answer_comparison_mode}. "
                "Expected one of 'ternary', 'exact', 'judge'."
            )

QagsScorerFactory

Bases: ScorerFactory

Scorer factory for QAGS.

Methods:

Name	Description
`__init__`	Initialize the QAGS scorer factory.
`create_scorer`	Creates a QAGS scorer.

Source code in evalsense/evaluation/evaluators/qags.py

class QagsScorerFactory(ScorerFactory):
    """Scorer factory for QAGS."""

    def __init__(
        self,
        name: str,
        config: QagsConfig,
        metrics: list[Metric | dict[str, list[Metric]]]
        | dict[str, list[Metric]]
        | None = None,
    ):
        """
        Initialize the QAGS scorer factory.

        Args:
            config (QagsConfig): The configuration for the QAGS scorer.
            metrics (list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None):
                The metrics to use for the evaluation. If `None`, the default metric
                will be used (G-Eval).
        """
        self.name = name
        self.config = config
        if metrics is None:
            if self.config.answer_comparison_mode == "ternary":
                metrics = [
                    {
                        f"{name} Coverage": [mean()],
                        f"{name} Groundedness": [mean()],
                        f"{name} Accuracy": [mean()],
                    }
                ]
            else:
                metrics = [mean()]
        self.metrics = metrics

    @override
    def create_scorer(self, model: Model) -> Scorer:
        """
        Creates a QAGS scorer.

        Args:
            model (Model): The model to create a scorer for.

        Returns:
            Scorer: The created QAGS scorer.
        """

        @scorer(name=self.name, metrics=self.metrics)
        def qags_scorer() -> Scorer:
            qags_score_calculator = QagsScoreCalculator(
                model=model,
                config=self.config,
                name=self.name,
            )

            async def score(state: TaskState, target: Target):
                return await qags_score_calculator.calculate_async(
                    input=state.input_text,
                    prediction=state.output.completion,
                    reference=target.text,
                    metadata=state.metadata,
                )

            return score

        return qags_scorer()

init

__init__(
    name: str,
    config: QagsConfig,
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
)

Initialize the QAGS scorer factory.

Parameters:

Name	Type	Description	Default
`config`	`QagsConfig`	The configuration for the QAGS scorer.	required
`metrics`	`list[Metric \| dict[str, list[Metric]]] \| dict[str, list[Metric]] \| None`	The metrics to use for the evaluation. If `None`, the default metric will be used (G-Eval).	`None`

Source code in evalsense/evaluation/evaluators/qags.py

def __init__(
    self,
    name: str,
    config: QagsConfig,
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
):
    """
    Initialize the QAGS scorer factory.

    Args:
        config (QagsConfig): The configuration for the QAGS scorer.
        metrics (list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None):
            The metrics to use for the evaluation. If `None`, the default metric
            will be used (G-Eval).
    """
    self.name = name
    self.config = config
    if metrics is None:
        if self.config.answer_comparison_mode == "ternary":
            metrics = [
                {
                    f"{name} Coverage": [mean()],
                    f"{name} Groundedness": [mean()],
                    f"{name} Accuracy": [mean()],
                }
            ]
        else:
            metrics = [mean()]
    self.metrics = metrics

create_scorer

create_scorer(model: Model) -> Scorer

Creates a QAGS scorer.

Parameters:

Name	Type	Description	Default
`model`	`Model`	The model to create a scorer for.	required

Returns:

Name	Type	Description
`Scorer`	`Scorer`	The created QAGS scorer.

Source code in evalsense/evaluation/evaluators/qags.py

@override
def create_scorer(self, model: Model) -> Scorer:
    """
    Creates a QAGS scorer.

    Args:
        model (Model): The model to create a scorer for.

    Returns:
        Scorer: The created QAGS scorer.
    """

    @scorer(name=self.name, metrics=self.metrics)
    def qags_scorer() -> Scorer:
        qags_score_calculator = QagsScoreCalculator(
            model=model,
            config=self.config,
            name=self.name,
        )

        async def score(state: TaskState, target: Target):
            return await qags_score_calculator.calculate_async(
                input=state.input_text,
                prediction=state.output.completion,
                reference=target.text,
                metadata=state.metadata,
            )

        return score

    return qags_scorer()

RougeScoreCalculator

Bases: ScoreCalculator

Calculator for computing ROUGE scores.

Methods:

Name	Description
`calculate`	Calculates ROUGE scores for the supplied model prediction and reference input.
`calculate_async`	Calculates ROUGE scores for the supplied model prediction and reference input.

Source code in evalsense/evaluation/evaluators/rouge.py

class RougeScoreCalculator(ScoreCalculator):
    """Calculator for computing ROUGE scores."""

    def __init__(self):
        self.rouge_module = evaluate.load("rouge")

    @override
    def calculate(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        **kwargs: dict,
    ) -> Score:
        """
        Calculates ROUGE scores for the supplied model prediction and reference input.

        Args:
            prediction (str): The text of the prediction from the model.
            input (str, optional): The text of the input to the model. Ignored for ROUGE.
            reference (str, optional): The text of the reference input to compare against.
            metadata (dict[str, Any], optional): Additional metadata for the score.
                Ignored for ROUGE.

        Returns:
            Score: Inspect AI Score with the calculated evaluation results.
        """
        if reference is None:
            raise ValueError("Reference is required for computing ROUGE, but was None.")

        predictions = [prediction]
        references = [reference]

        result = self.rouge_module.compute(
            predictions=predictions, references=references
        )
        return Score(
            value={
                "ROUGE-1": result["rouge1"],  # type: ignore
                "ROUGE-2": result["rouge2"],  # type: ignore
                "ROUGE-L": result["rougeL"],  # type: ignore
            },
            answer=prediction,
        )

    @override
    async def calculate_async(
        self,
        *,
        prediction: str,
        input: str | None = None,
        reference: str | None = None,
        metadata: dict[str, Any] | None = None,
        **kwargs: dict,
    ) -> Score:
        """
        Calculates ROUGE scores for the supplied model prediction and reference input.

        Args:
            prediction (str): The text of the prediction from the model.
            input (str, optional): The text of the input to the model. Ignored for ROUGE.
            reference (str, optional): The text of the reference input to compare against.
            metadata (dict[str, Any], optional): Additional metadata for the score.
                Ignored for ROUGE.

        Returns:
            Score: Inspect AI Score with the calculated evaluation results.
        """
        return self.calculate(
            prediction=prediction,
            input=input,
            reference=reference,
            metadata=metadata,
            **kwargs,
        )

calculate

calculate(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score

Calculates ROUGE scores for the supplied model prediction and reference input.

Parameters:

Name	Type	Description	Default
`prediction`	`str`	The text of the prediction from the model.	required
`input`	`str`	The text of the input to the model. Ignored for ROUGE.	`None`
`reference`	`str`	The text of the reference input to compare against.	`None`
`metadata`	`dict[str, Any]`	Additional metadata for the score. Ignored for ROUGE.	`None`

Returns:

Name	Type	Description
`Score`	`Score`	Inspect AI Score with the calculated evaluation results.

Source code in evalsense/evaluation/evaluators/rouge.py

@override
def calculate(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score:
    """
    Calculates ROUGE scores for the supplied model prediction and reference input.

    Args:
        prediction (str): The text of the prediction from the model.
        input (str, optional): The text of the input to the model. Ignored for ROUGE.
        reference (str, optional): The text of the reference input to compare against.
        metadata (dict[str, Any], optional): Additional metadata for the score.
            Ignored for ROUGE.

    Returns:
        Score: Inspect AI Score with the calculated evaluation results.
    """
    if reference is None:
        raise ValueError("Reference is required for computing ROUGE, but was None.")

    predictions = [prediction]
    references = [reference]

    result = self.rouge_module.compute(
        predictions=predictions, references=references
    )
    return Score(
        value={
            "ROUGE-1": result["rouge1"],  # type: ignore
            "ROUGE-2": result["rouge2"],  # type: ignore
            "ROUGE-L": result["rougeL"],  # type: ignore
        },
        answer=prediction,
    )

calculate_async `async`

calculate_async(
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score

Calculates ROUGE scores for the supplied model prediction and reference input.

Parameters:

Name	Type	Description	Default
`prediction`	`str`	The text of the prediction from the model.	required
`input`	`str`	The text of the input to the model. Ignored for ROUGE.	`None`
`reference`	`str`	The text of the reference input to compare against.	`None`
`metadata`	`dict[str, Any]`	Additional metadata for the score. Ignored for ROUGE.	`None`

Returns:

Name	Type	Description
`Score`	`Score`	Inspect AI Score with the calculated evaluation results.

Source code in evalsense/evaluation/evaluators/rouge.py

@override
async def calculate_async(
    self,
    *,
    prediction: str,
    input: str | None = None,
    reference: str | None = None,
    metadata: dict[str, Any] | None = None,
    **kwargs: dict,
) -> Score:
    """
    Calculates ROUGE scores for the supplied model prediction and reference input.

    Args:
        prediction (str): The text of the prediction from the model.
        input (str, optional): The text of the input to the model. Ignored for ROUGE.
        reference (str, optional): The text of the reference input to compare against.
        metadata (dict[str, Any], optional): Additional metadata for the score.
            Ignored for ROUGE.

    Returns:
        Score: Inspect AI Score with the calculated evaluation results.
    """
    return self.calculate(
        prediction=prediction,
        input=input,
        reference=reference,
        metadata=metadata,
        **kwargs,
    )

bleu_metric

bleu_metric() -> MetricProtocol

Base metric for BLEU scores.

Returns:

Name	Type	Description
`MetricProtocol`	`MetricProtocol`	A function that computes BLEU scores.

Source code in evalsense/evaluation/evaluators/bleu.py

def bleu_metric() -> MetricProtocol:
    """
    Base metric for BLEU scores.

    Returns:
        MetricProtocol: A function that computes BLEU scores.
    """

    def metric(scores: list[SampleScore]) -> Value:
        bleu_module = evaluate.load("bleu")
        predictions = [score.score.metadata["prediction"] for score in scores]  # type: ignore
        references = [score.score.metadata["reference"] for score in scores]  # type: ignore
        result = bleu_module.compute(predictions=predictions, references=references)
        result = cast(dict[str, Any], result)
        return result["bleu"]

    return metric

get_bertscore_evaluator

get_bertscore_evaluator(
    *,
    name: str = "BERTScore",
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
    model_type: str = "microsoft/deberta-xlarge-mnli",
    lang: str = "en",
    num_layers: int | None = None,
    verbose: bool = False,
    idf: bool | dict[str, float] = False,
    device: str | None = None,
    batch_size: int = 64,
    nthreads: int = 1,
    rescale_with_baseline: bool = False,
    baseline_path: str | None = None,
    use_fast_tokenizer: bool = False,
) -> Evaluator

Returns a BERTScore evaluator.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the evaluator and scorer. Defaults to "BERTScore".	`'BERTScore'`
`metrics`	`list[Metric \| dict[str, list[Metric]]] \| dict[str, list[Metric]] \| None`	The metrics to use for the evaluation. If `None`, the default metrics will be used (BERTScore Precision, Recall, and F1 with mean aggregation).	`None`
`model_type`	`str`	The model type to use for computing BERTScore. Defaults to "microsoft/deberta-xlarge-mnli", the currently best-performing model according to BERTScore authors.	`'microsoft/deberta-xlarge-mnli'`
`lang`	`str`	The language of the text. Defaults to "en".	`'en'`
`num_layers`	`int \| None`	The layer of representations to use. The default is the number of layers tuned on WMT16 correlation data, which depends on the `model_type` used.	`None`
`verbose`	`bool`	Whether to turn on verbose mode. Defaults to `False`	`False`
`idf`	`bool \| dict`	Use IDF weighting — can be a precomputed IDF dictionary. Defaults to `False` (no IDF weighting).	`False`
`device`	`str \| None`	The device to use for computing the contextual embeddings. If this argument is not set or `None`, the model will be loaded on `cuda:0` if available.	`None`
`nthreads`	`int`	The number of threads to use for computing the contextual embeddings. Defaults to `1`.	`1`
`batch_size`	`int`	The batch size to use for computing the contextual embeddings. Defaults to `64`.	`64`
`rescale_with_baseline`	`bool`	Whether to rescale the BERTScore with pre-computed baseline. The default value is `False`.	`False`
`baseline_path`	`str \| None`	Customized baseline file.	`None`
`use_fast_tokenizer`	`bool`	The `use_fast` parameter passed to HF tokenizer. Defaults to `False`.	`False`

Returns:

Name	Type	Description
`Evaluator`	`Evaluator`	The BERTScore evaluator.

Source code in evalsense/evaluation/evaluators/bertscore.py

def get_bertscore_evaluator(
    *,
    name: str = "BERTScore",
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
    model_type: str = "microsoft/deberta-xlarge-mnli",
    lang: str = "en",
    num_layers: int | None = None,
    verbose: bool = False,
    idf: bool | dict[str, float] = False,
    device: str | None = None,
    batch_size: int = 64,
    nthreads: int = 1,
    rescale_with_baseline: bool = False,
    baseline_path: str | None = None,
    use_fast_tokenizer: bool = False,
) -> Evaluator:
    """
    Returns a BERTScore evaluator.

    Args:
        name (str): The name of the evaluator and scorer. Defaults to "BERTScore".
        metrics (list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None):
            The metrics to use for the evaluation. If `None`, the default metrics
            will be used (BERTScore Precision, Recall, and F1 with mean aggregation).
        model_type (str, optional): The model type to use for computing BERTScore.
            Defaults to "microsoft/deberta-xlarge-mnli", the currently best-performing
            model according to BERTScore authors.
        lang (str, optional): The language of the text. Defaults to "en".
        num_layers (int | None, optional): The layer of representations to use. The
            default is the number of layers tuned on WMT16 correlation data, which
            depends on the `model_type` used.
        verbose (bool, optional): Whether to turn on verbose mode. Defaults to `False`
        idf (bool | dict, optional): Use IDF weighting — can be a precomputed IDF dictionary.
            Defaults to `False` (no IDF weighting).
        device (str | None, optional): The device to use for computing the contextual
            embeddings. If this argument is not set or `None`, the model will be
            loaded on `cuda:0` if available.
        nthreads (int, optional): The number of threads to use for computing the
            contextual embeddings. Defaults to `1`.
        batch_size (int, optional): The batch size to use for computing the
            contextual embeddings. Defaults to `64`.
        rescale_with_baseline (bool, optional): Whether to rescale the BERTScore with
            pre-computed baseline. The default value is `False`.
        baseline_path (str | None, optional): Customized baseline file.
        use_fast_tokenizer (bool, optional): The `use_fast` parameter passed to HF
            tokenizer. Defaults to `False`.

    Returns:
        Evaluator: The BERTScore evaluator.
    """
    if metrics is None:
        metrics = [
            {"BERTScore Precision": [mean()]},
            {"BERTScore Recall": [mean()]},
            {"BERTScore F1": [mean()]},
        ]

    calculator = BertScoreCalculator(
        model_type=model_type,
        lang=lang,
        num_layers=num_layers,
        idf=idf,
    )

    async def init_bertscore() -> None:
        async with concurrency("init_bertscore", 1):
            if not hasattr(calculator, "bertscore_module"):
                setattr(
                    calculator,
                    "bertscore_module",
                    evaluate.load("bertscore"),
                )

    def cleanup_bertscore() -> None:
        import torch

        del calculator.bertscore_module
        gc.collect()
        torch.cuda.empty_cache()

    @scorer(
        name=name,
        metrics=metrics,
    )
    def bertscore_scorer() -> Scorer:
        async def score(state: TaskState, target: Target) -> Score:
            await init_bertscore()

            return await calculator.calculate_async(
                prediction=state.output.completion,
                reference=target.text,
                verbose=verbose,
                device=device,
                batch_size=batch_size,
                nthreads=nthreads,
                rescale_with_baseline=rescale_with_baseline,
                baseline_path=baseline_path,
                use_fast_tokenizer=use_fast_tokenizer,
            )

        return score

    return Evaluator(
        name=name,
        scorer=bertscore_scorer(),
        cleanup_fun=cleanup_bertscore,
    )

get_bleu_evaluator

get_bleu_evaluator(
    name: str = "BLEU",
    scorer_name: str = "BLEU Precision",
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
) -> Evaluator

Returns an evaluator for BLEU scores.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the metric and evaluator. Defaults to "BLEU".	`'BLEU'`
`scorer_name`	`str`	The name of the internal scorer. Defaults to "BLEU Precision".	`'BLEU Precision'`
`metrics`	`list[Metric \| dict[str, list[Metric]]] \| dict[str, list[Metric]] \| None`	The metrics to use for the evaluation. If `None`, the default metric will be used (BLEU).	`None`

Returns:

Name	Type	Description
`Evaluator`	`Evaluator`	An evaluator for BLEU scores.

Source code in evalsense/evaluation/evaluators/bleu.py

def get_bleu_evaluator(
    name: str = "BLEU",
    scorer_name: str = "BLEU Precision",
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
) -> Evaluator:
    """
    Returns an evaluator for BLEU scores.

    Args:
        name (str): The name of the metric and evaluator. Defaults to "BLEU".
        scorer_name (str): The name of the internal scorer. Defaults to "BLEU Precision".
        metrics (list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None):
            The metrics to use for the evaluation. If `None`, the default metric
            will be used (BLEU).

    Returns:
        Evaluator: An evaluator for BLEU scores.
    """

    @metric(name=name)
    def bleu() -> MetricProtocol:
        return bleu_metric()

    if metrics is None:
        metrics = [bleu()]

    bleu_calculator = BleuPrecisionScoreCalculator()

    @scorer(name=scorer_name, metrics=metrics)
    def bleu_precision_scorer() -> Scorer:
        async def score(state: TaskState, target: Target):
            return await bleu_calculator.calculate_async(
                prediction=state.output.completion, reference=target.text
            )

        return score

    return Evaluator(name, scorer=bleu_precision_scorer())

get_g_eval_evaluator

get_g_eval_evaluator(
    *,
    name: str = "G-Eval",
    quality_name: str = "Unknown",
    model_name: str | None = None,
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
    prompt_template: str,
    model_config: ModelConfig,
    logprobs: bool = True,
    top_logprobs: int = 20,
    min_score: int = 1,
    max_score: int = 10,
    normalise: bool = True,
    debug: bool = False,
) -> Evaluator

Constructs a G-Eval evaluator that can be used in EvalSense evaluation pipeline.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the evaluator. Defaults to "G-Eval".	`'G-Eval'`
`quality_name`	`str`	The name of the quality to be evaluated by G-Eval.	`'Unknown'`
`model_name`	`str \| None`	The name of the model to be used for evaluation. If `None`, the model name will be taken from the model config.	`None`
`metrics`	`list[Metric \| dict[str, list[Metric]]] \| dict[str, list[Metric]] \| None`	The metrics to use for the evaluation. If `None`, the default metric will be used (G-Eval).	`None`
`prompt_template`	`str`	The prompt template to use. The supplied template should be a format string with {prediction} and (optionally) {reference} as placeholders, as well as any additional placeholders for entries in Inspect AI sample/task state metadata. The template should instruct the judge model to respond with a numerical score between the specified min_score and max_score.	required
`model_config`	`ModelConfig`	The model configuration.	required
`logprobs`	`bool`	Whether to use model log probabilities to compute weighted evaluation score instead of a standard score.	`True`
`top_logprobs`	`int`	The number of top log probabilities to consider.	`20`
`min_score`	`int`	The minimum valid score.	`1`
`max_score`	`int`	The maximum valid score.	`10`
`normalise`	`bool`	Whether to normalise the scores between 0 and 1.	`True`
`debug`	`bool`	Whether to report repeated errors in the log.	`False`

Returns:

Name	Type	Description
`Evaluator`	`Evaluator`	The constructed G-Eval evaluator.

Source code in evalsense/evaluation/evaluators/g_eval.py

def get_g_eval_evaluator(
    *,
    name: str = "G-Eval",
    quality_name: str = "Unknown",
    model_name: str | None = None,
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
    prompt_template: str,
    model_config: ModelConfig,
    logprobs: bool = True,
    top_logprobs: int = 20,
    min_score: int = 1,
    max_score: int = 10,
    normalise: bool = True,
    debug: bool = False,
) -> Evaluator:
    """
    Constructs a G-Eval evaluator that can be used in EvalSense evaluation pipeline.

    Args:
        name (str): The name of the evaluator. Defaults to "G-Eval".
        quality_name (str): The name of the quality to be evaluated by G-Eval.
        model_name (str | None): The name of the model to be used for evaluation.
            If `None`, the model name will be taken from the model config.
        metrics (list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None):
            The metrics to use for the evaluation. If `None`, the default metric
            will be used (G-Eval).
        prompt_template (str): The prompt template to use. The supplied template should
            be a format string with {prediction} and (optionally) {reference} as
            placeholders, as well as any additional placeholders for entries in
            Inspect AI sample/task state metadata. The template should instruct the
            judge model to respond with a numerical score between the specified
            min_score and max_score.
        model_config (ModelConfig): The model configuration.
        logprobs (bool): Whether to use model log probabilities to compute weighted
            evaluation score instead of a standard score.
        top_logprobs (int): The number of top log probabilities to consider.
        min_score (int): The minimum valid score.
        max_score (int): The maximum valid score.
        normalise (bool): Whether to normalise the scores between 0 and 1.
        debug (bool): Whether to report repeated errors in the log.

    Returns:
        Evaluator: The constructed G-Eval evaluator.
    """
    metric_name = f"{name} ({quality_name}, {model_name or model_config.name})"
    return Evaluator(
        name=metric_name,
        scorer=GEvalScorerFactory(
            name=metric_name,
            metrics=metrics,
            prompt_template=prompt_template,
            logprobs=logprobs,
            top_logprobs=top_logprobs,
            min_score=min_score,
            max_score=max_score,
            normalise=normalise,
            debug=debug,
        ),
        model_config=model_config,
    )

get_qags_evaluator

get_qags_evaluator(
    *,
    config: QagsConfig,
    name: str = "QAGS",
    model_name: str | None = None,
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
    model_config: ModelConfig,
) -> Evaluator

Constructs a QAGS evaluator that can be used in EvalSense evaluation pipeline.

Parameters:

Name	Type	Description	Default
`config`	`QagsConfig`	The configuration for the QAGS evaluator.	required
`name`	`str`	The name of the QAGS evaluator.	`'QAGS'`
`model_name`	`str \| None`	The name of the model to use for evaluation. If `None`, the name from the model configuration will be used.	`None`
`metrics`	`list[Metric \| dict[str, list[Metric]]] \| dict[str, list[Metric]] \| None`	The metrics to use for the evaluation. If `None`, the default metrics will be used (QAGS precision, recall and F1).	`None`
`model_config`	`ModelConfig`	The configuration of the model to be used for evaluation.	required

Returns:

Name	Type	Description
`Evaluator`	`Evaluator`	The constructed QAGS evaluator.

Source code in evalsense/evaluation/evaluators/qags.py

def get_qags_evaluator(
    *,
    config: QagsConfig,
    name: str = "QAGS",
    model_name: str | None = None,
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
    model_config: ModelConfig,
) -> Evaluator:
    """
    Constructs a QAGS evaluator that can be used in EvalSense evaluation pipeline.

    Args:
        config (QagsConfig): The configuration for the QAGS evaluator.
        name (str): The name of the QAGS evaluator.
        model_name (str | None): The name of the model to use for evaluation.
            If `None`, the name from the model configuration will be used.
        metrics (list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None):
            The metrics to use for the evaluation. If `None`, the default metrics
            will be used (QAGS precision, recall and F1).
        model_config (ModelConfig): The configuration of the model to be used
            for evaluation.

    Returns:
        Evaluator: The constructed QAGS evaluator.
    """
    metric_name = f"{name} ({model_name or model_config.name})"
    return Evaluator(
        name=metric_name,
        scorer=QagsScorerFactory(
            name=metric_name,
            config=config,
            metrics=metrics,
        ),
        model_config=model_config,
    )

get_rouge_evaluator

get_rouge_evaluator(
    name: str = "ROUGE",
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
) -> Evaluator

Returns an evaluator for ROUGE scores.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the evaluator. Defaults to "ROUGE".	`'ROUGE'`
`metrics`	`list[Metric \| dict[str, list[Metric]]] \| dict[str, list[Metric]] \| None`	The metrics to use for evaluation. If None, defaults to ROUGE-1, ROUGE-2, and ROUGE-L with mean aggregation.	`None`

Returns:

Name	Type	Description
`Evaluator`	`Evaluator`	An evaluator for ROUGE scores.

Source code in evalsense/evaluation/evaluators/rouge.py

def get_rouge_evaluator(
    name: str = "ROUGE",
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None = None,
) -> Evaluator:
    """
    Returns an evaluator for ROUGE scores.

    Args:
        name (str): The name of the evaluator. Defaults to "ROUGE".
        metrics (list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None):
            The metrics to use for evaluation. If None, defaults to ROUGE-1, ROUGE-2,
            and ROUGE-L with mean aggregation.

    Returns:
        Evaluator: An evaluator for ROUGE scores.
    """
    if metrics is None:
        metrics = [
            {
                "ROUGE-1": [mean()],
                "ROUGE-2": [mean()],
                "ROUGE-L": [mean()],
            }
        ]

    rouge_calculator = RougeScoreCalculator()

    @scorer(name=name, metrics=metrics)
    def rouge_scorer() -> Scorer:
        async def score(state: TaskState, target: Target) -> Score:
            return await rouge_calculator.calculate_async(
                prediction=state.output.completion, reference=target.text
            )

        return score

    return Evaluator(name, scorer=rouge_scorer())

Evaluators

BertScoreCalculator

__init__

calculate

calculate_async async

BleuPrecisionScoreCalculator

calculate

calculate_async async

GEvalScoreCalculator

__init__

calculate

calculate_async async

GEvalScorerFactory

__init__

create_scorer

QagsConfig

__init__

enforce_not_none

get_answer_comparison_prompt

get_answer_generation_prompt abstractmethod

get_question_generation_prompt abstractmethod

QagsScoreCalculator

generate_config property

__init__

calculate

calculate_async async

QagsScorerFactory

__init__

create_scorer

RougeScoreCalculator

calculate

calculate_async async

bleu_metric

get_bertscore_evaluator

get_bleu_evaluator

get_g_eval_evaluator

get_qags_evaluator

get_rouge_evaluator

init

calculate_async `async`

calculate_async `async`

init

calculate_async `async`

init

init

get_answer_generation_prompt `abstractmethod`

get_question_generation_prompt `abstractmethod`

generate_config `property`

init

calculate_async `async`

init

calculate_async `async`