Welcome to EvalSense
EvalSense is a tool for systematic evaluation of large language models (LLMs) on open-ended generation tasks, with a particular focus on bespoke, domain-specific evaluations. Some of its key features include:
- Broad model support. Out-of-the-box compatibility with a wide range of local and API-based model providers, including Ollama, Hugging Face, vLLM, OpenAI, Anthropic and others.
- Evaluation guidance. An interactive evaluation guide and automated meta-evaluation tools assist in selecting the most appropriate evaluation methods for a specific use-case, including the use of perturbed data to assess method effectiveness.
- Interactive UI. A web-based interface enables rapid experimentation with different evaluation workflows without requiring any code.
- Advanced evaluation methods. EvalSense incorporates recent LLM-as-a-Judge and hybrid evaluation approaches, such as G-Eval and QAGS, while also supporting more traditional metrics like BERTScore and ROUGE.
- Efficient execution. Intelligent experiment scheduling and resource management minimise computational overhead for local models. For remote APIs, EvalSense uses asynchronous parallel calls to maximise throughput.
- Modularity and extensibility. Key components and evaluation methods can be used independently or replaced with user-defined implementations.
- Comprehensive logging. All key aspects of evaluation are recorded in machine-readable logs, including model parameters, prompts, model outputs, evaluation results, and other metadata.
Quick Start
Installation
You can install the project using pip by running the following command:
This will install the latest released version of the package from PyPI without any optional dependencies.
Depending on your use-case, you may want to install additional dependencies from the following groups:
webui
: For using the interactive web UI.jupyter
: For running experiments in Jupyter notebooks (only needed if you don't already have the necessary libraries installed).transformers
: For using models and metrics requiring the Hugging Face Transformers library.vllm
: For using models and metrics requiring vLLM.interactive
: For using EvalSense with interactive UI features (currently includeswebui
andjupyter
).local
: For installing all local model dependencies (currently includestransformers
andvllm
).all
: For installing all optional dependencies.
For example, if you want to install EvalSense with all optional dependencies, you can run:
If you want to use EvalSense with interactive features (interactive
) and Hugging Face Transformers (transformers
), you can run:
and similarly for other combinations.
Programmatic Usage
For examples illustrating the usage of EvalSense, please check the notebooks under the notebooks/
folder:
- The Demo notebook illustrates a basic application of EvalSense to the ACI-Bench dataset.
- The Experiments notebook illustrates more thorough experiments on the same dataset, involving a larger number of evaluators and models.
- The Meta-Evaluation notebook focuses on meta-evaluation on synthetically perturbed data, where the goal is to identify the most reliable evaluation methods rather than the best-performing models.
Web-Based UI
To use the interactive web-based UI implemented in EvalSense, simply run
after installing the package and its dependencies. Note that you need to install EvalSense with the webui
extra (pip install "evalsense[webui]"
) or an extra that includes it before running this command.
Acknowledgements
We thank the Inspect AI development team for their work on the Inspect AI library, which serves as a basis for EvalSense.