5. Skip to content

5. Evaluating generated outputs

How do we run evaluate an open source LLM? The following table contains a (non-exhaustive) list of methods to evaluate an open source LLM.

Some of these projects such as lm-evaluation-harness provide extensive evaluation tools however they can be cumbersome to set up and computationally expensive to run locally due to the vast number of requests each evaluation task passes to the LLM.

Projects

Framework

A unified framework to test LLMs on a large number of different evaluation tasks.

GitHub

Benchmark, LLM as a judge

  • Uses MT-bench, a set of challenging multi-turn open-ended questions to evaluate models
  • To automate the evaluation process, FastChat prompts strong LLMs like GPT-4 to act as judges and assess the quality of responses

GitHub

Benchmark

  • An LLM-based automatic evaluation that validated against human annotations
  • Evaluates by measuring the fraction of times a powerful LLM (e.g. GPT-4, Claude or ChatGPT) prefers the outputs from a LLM over outputs from a reference LLM

GitHub

Multiple choice tests

A variety of 57 tasks to assess an LLMs general knowledge and ability to problem solve.

arXiv

Custom

  • A tool for testing and evaluating LLM output quality
  • Define test cases to score LLM outputs

GitHub

Application

  • An open source LLMOps platform for prompt engineering, evaluation, human feedback, and deployment of complex LLM apps
  • Provides a nice GUI to iterate versions
  • Multiple evaluation methods and metrics available out of the box

GitHub

Benchmark, LLM as a judge

  • Self-host tools for experimenting with, testing, and evaluating LLMs
  • Evaluation tools
  • Supports multiple LLMs, vector databases, frameworks and Stable Diffusion

GitHub

Framework

  • A framework for evaluating LLMs or systems built using LLMs as components
  • Includes an open source registry of challenging evals
  • An "eval" is a task used to evaluate the quality of a system's behaviour
  • Requires an OpenAI API key

GitHub

When using pre-trained models it may be more effective to review performance benchmarks and metrics already published, for example the chatbot-arena-leaderboard on Hugging Face.

If we start fine-tuning models, it may be worth considering a more efficient platform for running evaluation tasks.