Measures the longest common subsequence (LCS) between the output and the reference.
Measures modified n-gram precision between the output and the reference with a brevity penalty.
Measures semantic similarity between tokens in candidate and reference texts using contextual embeddings from BERT.
Domain-specific metric for healthcare. Computes F1 score based on overlap of Unified Medical Language System (UMLS) concepts between candidate and reference texts.
Leverages large language models (LLMs) as judges to evaluate text quality based on user-defined criteria and scoring rubrics provided in a prompt.
Evaluates factual consistency by comparing answers to questions generated from a background text and the model output, using an LLM.
Evaluates model performance across different subgroups of data to identify potential biases or performance disparities.
Measures the total number of words in the generated text. A basic indicator of output conciseness.
Measures the ratio of the length of the generated text (e.g., summary) to the length of the source or reference text.
Measures the proportion of correct predictions out of all predictions made in a classification task.
Measures the proportion of true positive predictions among all positive predictions made by the model.
Measures the proportion of actual positive instances that were correctly identified by the model.
The harmonic mean of precision and recall, providing a single score that balances both concerns.
Assesses model performance on out-of-distribution (OOD) data that differs significantly from the training data distribution.
Assesses model resilience against small, intentionally crafted perturbations to inputs designed to cause mispredictions or undesirable behavior.