Evaluation Method Catalogue
ROUGE-L
Measures the longest common subsequence (LCS) between the output and the reference.
BLEU
Measures modified n-gram precision between the output and the reference with a brevity penalty.
BERTScore
Measures semantic similarity between tokens in candidate and reference texts using contextual embeddings from BERT.
MEDCON
Domain-specific metric for healthcare. Computes F1 score based on overlap of Unified Medical Language System (UMLS) concepts between candidate and reference texts.
G-Eval
Leverages large language models (LLMs) as judges to evaluate text quality based on user-defined criteria and scoring rubrics provided in a prompt.
QAGS
Evaluates factual consistency by comparing answers to questions generated from a background text and the model output, using an LLM.
Stratified Performance Evaluation
Evaluates model performance across different subgroups of data to identify potential biases or performance disparities.
Word Count
Measures the total number of words in the generated text. A basic indicator of output conciseness.
Compression Ratio
Measures the ratio of the length of the generated text (e.g., summary) to the length of the source or reference text.
Accuracy
Measures the proportion of correct predictions out of all predictions made in a classification task.
Precision
Measures the proportion of true positive predictions among all positive predictions made by the model.
Recall
Measures the proportion of actual positive instances that were correctly identified by the model.
F1-Score
The harmonic mean of precision and recall, providing a single score that balances both concerns.
OOD Performance Evaluation
Assesses model performance on out-of-distribution (OOD) data that differs significantly from the training data distribution.
Adversarial Robustness Evaluation
Assesses model resilience against small, intentionally crafted perturbations to inputs designed to cause mispredictions or undesirable behavior.