Skip to content

Welcome to EvalSense

EvalSense is a tool for systematic evaluation of large language models (LLMs) on open-ended generation tasks, with a particular focus on bespoke, domain-specific evaluations. Some of its key features include:

  • Broad model support. Out-of-the-box compatibility with a wide range of local and API-based model providers, including Ollama, Hugging Face, vLLM, OpenAI, Anthropic and others.
  • Evaluation guidance. An interactive evaluation guide and automated meta-evaluation tools assist in selecting the most appropriate evaluation methods for a specific use-case, including the use of perturbed data to assess method effectiveness.
  • Interactive UI. A web-based interface enables rapid experimentation with different evaluation workflows without requiring any code.
  • Advanced evaluation methods. EvalSense incorporates recent LLM-as-a-Judge and hybrid evaluation approaches, such as G-Eval and QAGS, while also supporting more traditional metrics like BERTScore and ROUGE.
  • Efficient execution. Intelligent experiment scheduling and resource management minimise computational overhead for local models. For remote APIs, EvalSense uses asynchronous parallel calls to maximise throughput.
  • Modularity and extensibility. Key components and evaluation methods can be used independently or replaced with user-defined implementations.
  • Comprehensive logging. All key aspects of evaluation are recorded in machine-readable logs, including model parameters, prompts, model outputs, evaluation results, and other metadata.

Quick Start

Installation

You can install the project using pip by running the following command:

pip install evalsense

This will install the latest released version of the package from PyPI without any optional dependencies.

Depending on your use-case, you may want to install additional dependencies from the following groups:

  • webui: For using the interactive web UI.
  • jupyter: For running experiments in Jupyter notebooks (only needed if you don't already have the necessary libraries installed).
  • transformers: For using models and metrics requiring the Hugging Face Transformers library.
  • vllm: For using models and metrics requiring vLLM.
  • interactive: For using EvalSense with interactive UI features (currently includes webui and jupyter).
  • local: For installing all local model dependencies (currently includes transformers and vllm).
  • all: For installing all optional dependencies.

For example, if you want to install EvalSense with all optional dependencies, you can run:

pip install "evalsense[all]"

If you want to use EvalSense with interactive features (interactive) and Hugging Face Transformers (transformers), you can run:

pip install "evalsense[interactive,transformers]"

and similarly for other combinations.

Programmatic Usage

For examples illustrating the usage of EvalSense, please check the notebooks under the notebooks/ folder:

  • The Demo notebook illustrates a basic application of EvalSense to the ACI-Bench dataset.
  • The Experiments notebook illustrates more thorough experiments on the same dataset, involving a larger number of evaluators and models.
  • The Meta-Evaluation notebook focuses on meta-evaluation on synthetically perturbed data, where the goal is to identify the most reliable evaluation methods rather than the best-performing models.

Web-Based UI

To use the interactive web-based UI implemented in EvalSense, simply run

evalsense webui

after installing the package and its dependencies. Note that you need to install EvalSense with the webui extra (pip install "evalsense[webui]") or an extra that includes it before running this command.

Acknowledgements

We thank the Inspect AI development team for their work on the Inspect AI library, which serves as a basis for EvalSense.