Evaluation Method Catalogue

Search Methods

Tasks

Qualities

Risks

ROUGE-L

Automated Metric

Measures the longest common subsequence (LCS) between the output and the reference.

BLEU

Automated Metric

Measures modified n-gram precision between the output and the reference with a brevity penalty.

BERTScore

Automated Metric

Measures semantic similarity between tokens in candidate and reference texts using contextual embeddings from BERT.

MEDCON

Automated Metric

Domain-specific metric for healthcare. Computes F1 score based on overlap of Unified Medical Language System (UMLS) concepts between candidate and reference texts.

G-Eval

Automated Metric

Leverages large language models (LLMs) as judges to evaluate text quality based on user-defined criteria and scoring rubrics provided in a prompt.

QAGS

Automated Metric

Evaluates factual consistency by comparing answers to questions generated from a background text and the model output, using an LLM.

Stratified Performance Evaluation

Evaluation Strategy

Evaluates model performance across different subgroups of data to identify potential biases or performance disparities.

Word Count

Automated Metric

Measures the total number of words in the generated text. A basic indicator of output conciseness.

Compression Ratio

Automated Metric

Measures the ratio of the length of the generated text (e.g., summary) to the length of the source or reference text.

Accuracy

Automated Metric

Measures the proportion of correct predictions out of all predictions made in a classification task.

Precision

Automated Metric

Measures the proportion of true positive predictions among all positive predictions made by the model.

Recall

Automated Metric

Measures the proportion of actual positive instances that were correctly identified by the model.

F1-Score

Automated Metric

The harmonic mean of precision and recall, providing a single score that balances both concerns.

OOD Performance Evaluation

Evaluation Strategy

Assesses model performance on out-of-distribution (OOD) data that differs significantly from the training data distribution.

Adversarial Robustness Evaluation

Evaluation Strategy

Assesses model resilience against small, intentionally crafted perturbations to inputs designed to cause mispredictions or undesirable behavior.