Welcome to EvalSense

EvalSense is a tool for systematic evaluation of large language models (LLMs) on open-ended generation tasks, with a particular focus on bespoke, domain-specific evaluations. Some of its key features include:

Broad model support. Out-of-the-box compatibility with a wide range of local and API-based model providers, including Ollama, Hugging Face, vLLM, OpenAI, Anthropic and others.
Evaluation guidance. An interactive evaluation guide and automated meta-evaluation tools assist in selecting the most appropriate evaluation methods for a specific use-case, including the use of perturbed data to assess method effectiveness.
Interactive UI. A web-based interface enables rapid experimentation with different evaluation workflows without requiring any code.
Advanced evaluation methods. EvalSense incorporates recent LLM-as-a-Judge and hybrid evaluation approaches, such as G-Eval and QAGS, while also supporting more traditional metrics like BERTScore and ROUGE.
Efficient execution. Intelligent experiment scheduling and resource management minimise computational overhead for local models. For remote APIs, EvalSense uses asynchronous parallel calls to maximise throughput.
Modularity and extensibility. Key components and evaluation methods can be used independently or replaced with user-defined implementations.
Comprehensive logging. All key aspects of evaluation are recorded in machine-readable logs, including model parameters, prompts, model outputs, evaluation results, and other metadata.

Quick Start

Installation

You can install the project using pip by running the following command:

pip install evalsense

This will install the latest released version of the package from PyPI without any optional dependencies.

Depending on your use-case, you may want to install additional dependencies from the following groups:

webui: For using the interactive web UI.
jupyter: For running experiments in Jupyter notebooks (only needed if you don't already have the necessary libraries installed).
transformers: For using models and metrics requiring the Hugging Face Transformers library.
vllm: For using models and metrics requiring vLLM.
interactive: For using EvalSense with interactive UI features (currently includes webui and jupyter).
local: For installing all local model dependencies (currently includes transformers and vllm).
all: For installing all optional dependencies.

For example, if you want to install EvalSense with all optional dependencies, you can run:

pip install "evalsense[all]"

If you want to use EvalSense with interactive features (interactive) and Hugging Face Transformers (transformers), you can run:

pip install "evalsense[interactive,transformers]"

and similarly for other combinations.

Programmatic Usage

For examples illustrating the usage of EvalSense, please check the notebooks under the notebooks/ folder:

The Demo notebook illustrates a basic application of EvalSense to the ACI-Bench dataset.
The Experiments notebook illustrates more thorough experiments on the same dataset, involving a larger number of evaluators and models.
The Meta-Evaluation notebook focuses on meta-evaluation on synthetically perturbed data, where the goal is to identify the most reliable evaluation methods rather than the best-performing models.

Web-Based UI

To use the interactive web-based UI implemented in EvalSense, simply run

evalsense webui

after installing the package and its dependencies. Note that you need to install EvalSense with the webui extra (pip install "evalsense[webui]") or an extra that includes it before running this command.

Acknowledgements

We thank the Inspect AI development team for their work on the Inspect AI library, which serves as a basis for EvalSense.