spiceai/docs

spiceai/

docs

Help Login

trunk

Edit on GitHub

Fork

/docs/website/versioned_docs/version-1.11.x/features/large-language-models/evals.md

spiceai/docs | Spice Cloud Platform

trunk

Edit on GitHub

Fork

/docs/website/versioned_docs/version-1.11.x/features/large-language-models/evals.md

spiceai/docs/README.md

title: 'Evaluating Language Models' sidebar_label: 'Evals' description: 'Learn how Spice evaluates, tracks, compares, and improves language model performance for specific tasks' sidebar_position: 4 pagination_prev: null pagination_next: null tags:

models
evaluation
performance

Language models can perform complex tasks. Evals help measure a model's ability to perform a specific task. Evals are defined as Spicepod components and can evaluate any Spicepod model's performance.

Refer to the Cookbook for related examples.

Overview

In Spice, an eval consists of the following core components:

Evals: A defined task for a model to perform and a method to measure its performance.
Eval Run: An single evaluation of a specific model.
Eval Result: The model output and score for a single input task within an eval run.
Eval Scorer: A method to score the model's performance on an eval result.

Eval Components

An eval component is defined as follows:

Where:

name is a unique identifier for this eval (like models, datasets, etc.).
dataset is a dataset component.
scorers is a list of scoring methods.

For complete details on the evals component, see the Spicepod reference.

Running an Eval

To run an eval:

Define an eval component (and its associated dataset).
Add a language model to the spicepod (this is the model that will be evaluated).

Start an eval via the HTTP API:

Depending on the dataset and model, the eval run can take some time to complete. On completion, results will be available in two tables:

eval.runs: Summarises the status and scores from the eval run.
eval.results: Contains the input, expected output, and actual output for each eval run, and the score from each scorer.

Dataset Formats

Datasets are used to define the input and expected output for an eval. Evals expect a particular format,

input: The input to the model. It should be either:
- A plain string (e.g., "Hello, how are you?"), interpreted as a single user message.
- A JSON array is interpreted as multiple OpenAI-compatible messages (e.g., [{"role":"system","content":"You are a helpful assistant."}, ...]).
For the ideal column:
- A plain string (e.g., "I'm doing well, thanks!"), interpreted as a single assistant response.
- A JSON array is interpreted as multiple OpenAI-compatible choices (e.g., [{"index":0,"message":{"role":"assistant","content":"Sure!"}, ...}]).

To use a dataset with a different format, use a view. For example:

Eval Scorers

An eval scorer is a method to score the model's performance on a single eval case. A scorer is given the input given to the model, the models output and the expected output and produces an associated score. Spice has several out of the box scorers:

match: Checks for an exact match between the expected and actual outputs.
json_match: Checks for an equivalent JSON between expected and actual outputs.
includes: Checks for the actual output to include the expected output.
fuzzy_match: Checks whether a normalised version (ignoring casing, punctuation, articles (e.g. a, the), excess whitespace) of either the expected and actual outputs are a subset of the other.
levenshtein: Computes the Levenshtein distance between the two output strings, normalised to the string length. The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

Spice has two other methods to define new scorers based on other spicepod components:

Embedding models can be used to compute the similarity between the expected and actual output from the model being evaluated. Any embeddings model defined in the spicepod.yaml is automatically available as a scorer.
Other language models can be used to judge the model being evaluated. This is often called an LLM-as-a-judge. Any models model defined in the spicepod.yaml is automatically available as a scorer. Note, however, that these models should generally be configured purposefully to be a judge. There are also constraints the model must satisfy, see below.

Below is an example of an eval that uses all three: a builtin scorer, an embedding model scorer and an LLM judge.

LLM-as-a-Judge {#llm-judge}

Spicepod models can be used to provide eval scores for other models. To do so in Spice, the LLM must:

Return a valid JSON as the response. The JSON must have at least a single number field .score. e.g.

Use Parameterized prompts to provide details about the eval step. When used as an eval scorer, the model will be provided with the following variables: input, actual & ideal. The type of these variables will depend on the dataset, as per the dataset format.