title: 'Evaluating Language Models' sidebar_label: 'Evals' description: 'Learn how Spice evaluates, tracks, compares, and improves language model performance for specific tasks' sidebar_position: 4 pagination_prev: null pagination_next: null tags:
Language models can perform complex tasks. Evals help measure a model's ability to perform a specific task. Evals are defined as Spicepod components and can evaluate any Spicepod model's performance.
Refer to the Cookbook for related examples.
In Spice, an eval consists of the following core components:
An eval component is defined as follows:
Where:
name is a unique identifier for this eval (like models, datasets, etc.).dataset is a dataset component.scorers is a list of scoring methods.For complete details on the evals component, see the Spicepod reference.
To run an eval:
eval component (and its associated dataset).Start an eval via the HTTP API:
Depending on the dataset and model, the eval run can take some time to complete. On completion, results will be available in two tables:
eval.runs: Summarises the status and scores from the eval run.eval.results: Contains the input, expected output, and actual output for each eval run, and the score from each scorer.Datasets are used to define the input and expected output for an eval. Evals expect a particular format,
input: The input to the model. It should be either:
"Hello, how are you?"), interpreted as a single user message.[{"role":"system","content":"You are a helpful assistant."}, ...]).ideal column:
"I'm doing well, thanks!"), interpreted as a single assistant response.[{"index":0,"message":{"role":"assistant","content":"Sure!"}, ...}]).To use a dataset with a different format, use a view. For example:
An eval scorer is a method to score the model's performance on a single eval case. A scorer is given the input given to the model, the models output and the expected output and produces an associated score. Spice has several out of the box scorers:
match: Checks for an exact match between the expected and actual outputs.json_match: Checks for an equivalent JSON between expected and actual outputs.includes: Checks for the actual output to include the expected output.fuzzy_match: Checks whether a normalised version (ignoring casing, punctuation, articles (e.g. a, the), excess whitespace) of either the expected and actual outputs are a subset of the other.levenshtein: Computes the Levenshtein distance between the two output strings, normalised to the string length. The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.Spice has two other methods to define new scorers based on other spicepod components:
embeddings model defined in the spicepod.yaml is automatically available as a scorer.models model defined in the spicepod.yaml is automatically available as a scorer. Note, however, that these models should generally be configured purposefully to be a judge. There are also constraints the model must satisfy, see below.Below is an example of an eval that uses all three: a builtin scorer, an embedding model scorer and an LLM judge.
Spicepod models can be used to provide eval scores for other models. To do so in Spice, the LLM must:
.score. e.g.input, actual & ideal. The type of these variables will depend on the dataset, as per the dataset format.