5. Evaluating generated outputs
How do we run evaluate an open source LLM? The following table contains a (non-exhaustive) list of methods to evaluate an open source LLM.
Some of these projects such as lm-evaluation-harness provide extensive evaluation tools however they can be cumbersome to set up and computationally expensive to run locally due to the vast number of requests each evaluation task passes to the LLM.
Projects
Benchmark, LLM as a judge
- Uses MT-bench, a set of challenging multi-turn open-ended questions to evaluate models
- To automate the evaluation process, FastChat prompts strong LLMs like GPT-4 to act as judges and assess the quality of responses
Benchmark
- An LLM-based automatic evaluation that validated against human annotations
- Evaluates by measuring the fraction of times a powerful LLM (e.g. GPT-4, Claude or ChatGPT) prefers the outputs from a LLM over outputs from a reference LLM
Multiple choice tests
A variety of 57 tasks to assess an LLMs general knowledge and ability to problem solve.
Custom
- A tool for testing and evaluating LLM output quality
- Define test cases to score LLM outputs
Application
- An open source LLMOps platform for prompt engineering, evaluation, human feedback, and deployment of complex LLM apps
- Provides a nice GUI to iterate versions
- Multiple evaluation methods and metrics available out of the box
Benchmark, LLM as a judge
- Self-host tools for experimenting with, testing, and evaluating LLMs
- Evaluation tools
- Supports multiple LLMs, vector databases, frameworks and Stable Diffusion
Framework
- A framework for evaluating LLMs or systems built using LLMs as components
- Includes an open source registry of challenging evals
- An "eval" is a task used to evaluate the quality of a system's behaviour
- Requires an OpenAI API key
When using pre-trained models it may be more effective to review performance benchmarks and metrics already published, for example the chatbot-arena-leaderboard on Hugging Face.
If we start fine-tuning models, it may be worth considering a more efficient platform for running evaluation tasks.