6. Metrics to assess the quality of generated outputs

This section discusses particular quantitative metrics we could use. It would be useful to evaluate on domain specific tasks as well as general LLM tasks.

6.1 Topics

Coherence: Does the output make sense?
Relevance: Is the output relevant to the prompt?
Fluency: Is the output grammatically correct?
Context understanding: Is the output style correct?
Diversity: How much does the output style vary?

6.2 Metrics

HF Evaluate Metric

Provides a wide range of evaluation metrics out of the box. Here are three examples:

Perplexity
- Documentation
- Perplexity is a measurement of how well a probability distribution or probability model predicts a sample
- Intuitively, perplexity can be understood as a measure of uncertainty. The perplexity of a language model can be seen as the level of perplexity when predicting the next word in a sequence. Good read: Understanding evaluation metrics for language models
- Practically, calculation of perplexity will depend on the context length of the LLM and HF Transformers provides an example on how to do this using a "sliding window" approach.
BLEU
- Bilingual Evaluation Understudy is a metric calculated by comparing machine/human natural language translations
- Requires human translation reference
ROUGE
- Recall-Oriented Understudy for Gisting Evaluation is a metric calculated by comparing machine/human summarisations
- Requires human summarisation reference

6.3 Further reading

A Metrics-First Approach to LLM Evaluation
Semantic Uncertainty

Introduce semantic entropy, a metric which incorporates linguistic invariances created by shared meanings to provide a more - predictive metric of model accuracy.
FEVER

Introduces Fact Extraction and VERification, a publicly available dataset for verification against textual sources
ProoFVer

Proposes fact verification system which uses a seq2seq model to generate natural logic-based inferences as proofs.

6.4 Challenges

Subjective human evaluation
Over-reliance on perplexity
Difficult to capture diversity and creativity
Metric performance won't necessarily translate to real use case performance
Dataset bias