Spice.ai Demo App
This is a Spice.ai data and AI app.
Prerequisites
- Spice.ai CLI installed
- OpenAI API key
- Hugging Face API token (optional, for LLaMA model)
curl
and jq
for API calls
Learn More
To learn more about Spice.ai, take a look at the following resources:
Connect with us on Discord - your feedback is appreciated!
Demo Steps
Publishing a Spice App in the Cloud
Step 1: Forking and Using the Dataset
- Fork the repository
https://github.com/jeadie/evals
into your GitHub org.
Step 2: Creating a New App in the Cloud
- Log into the Spice.ai Cloud Platform and create a new app called
evals
. The app will start empty.
- Connect the app to your repository:
- Go to the App Settings tab and select Connect Repository.
- If the repository is not yet linked, follow the prompts to authenticate and link it.
Step 3: Deploying the App
- Set the app to Public:
- Navigate to the app's settings and toggle the visibility to public.
- Redeploy the app:
- Click Redeploy to load the datasets and configurations from the repository.
Step 4: Verifying and Testing
- Check the datasets in the Spice.ai Cloud:
- Verify that the datasets are correctly loaded and accessible.
- Test public access:
- Log in with a different account to confirm the app is accessible to external users.
Initializing a Local Spice App
-
Initialize a new local Spice app
mkdir demo
cd demo
spice init
mkdir demo
cd demo
spice init
-
Login to Spice.ai Cloud
-
Get spicepod from Spicerack
Navigate to spicerack.org, search for evals
.
Click on /evals, click on Use this app, and copy the spice connect
command.
Paste the command into the terminal.
Navigate to spicerack.org, search for evals
, click on /evals, click on Use this app, and copy the spice connect
command. Paste the command into the terminal.
spice connect <username>/evals
spice connect <username>/evals
The spicepod.yml
should be updated to:
version: v1beta1
kind: Spicepod
name: demo
dependencies:
- Jeadie/evals
version: v1beta1
kind: Spicepod
name: demo
dependencies:
- Jeadie/evals
-
Add a model to the spicepod
models:
- name: gpt-4o
from: openai:gpt-4o
params:
openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
models:
- name: gpt-4o
from: openai:gpt-4o
params:
openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
-
Start spice
-
Run an eval
curl -XPOST "http://localhost:8090/v1/evals/taxes" -H "Content-Type: application/json" -d '{
"model": "gpt-4o"
}' | jq
curl -XPOST "http://localhost:8090/v1/evals/taxes" -H "Content-Type: application/json" -d '{
"model": "gpt-4o"
}' | jq
-
Explore incorrect results
SELECT
input,
output,
actual
FROM eval.results
WHERE value=0.0 LIMIT 5;
SELECT
input,
output,
actual
FROM eval.results
WHERE value=0.0 LIMIT 5;
Optional: Create an Eval to Use a Smaller Model
-
Track the outputs of all AI model calls:
runtime:
task_history:
captured_output: truncated
runtime:
task_history:
captured_output: truncated
-
Define a new view and evaluation:
views:
- name: user_queries
sql: |
SELECT
json_get_json(input, 'messages') AS input,
json_get_str((captured_output -> 0), 'content') as ideal
FROM runtime.task_history
WHERE task='ai_completion'
- name: latest_eval_runs
sql: |
SELECT model, MAX(created_at) as latest_run
FROM eval.runs
GROUP BY model
- name: model_stats
sql: |
SELECT
r.model,
COUNT(*) as total_queries,
SUM(CASE WHEN res.value = 1.0 THEN 1 ELSE 0 END) as correct_answers,
AVG(res.value) as accuracy
FROM eval.runs r
JOIN latest_eval_runs lr ON r.model = lr.model AND r.created_at = lr.latest_run
JOIN eval.results res ON res.run_id = r.id
GROUP BY r.model
evals:
- name: mimic-user-queries
description: |
Evaluates how well a model can copy the exact answers already returned to a user. Useful for testing if a smaller/cheaper model is sufficient.
dataset: user_queries
scorers:
- match
views:
- name: user_queries
sql: |
SELECT
json_get_json(input, 'messages') AS input,
json_get_str((captured_output -> 0), 'content') as ideal
FROM runtime.task_history
WHERE task='ai_completion'
- name: latest_eval_runs
sql: |
SELECT model, MAX(created_at) as latest_run
FROM eval.runs
GROUP BY model
- name: model_stats
sql: |
SELECT
r.model,
COUNT(*) as total_queries,
SUM(CASE WHEN res.value = 1.0 THEN 1 ELSE 0 END) as correct_answers,
AVG(res.value) as accuracy
FROM eval.runs r
JOIN latest_eval_runs lr ON r.model = lr.model AND r.created_at = lr.latest_run
JOIN eval.results res ON res.run_id = r.id
GROUP BY r.model
evals:
- name: mimic-user-queries
description: |
Evaluates how well a model can copy the exact answers already returned to a user. Useful for testing if a smaller/cheaper model is sufficient.
dataset: user_queries
scorers:
- match
-
Add a smaller model to the spicepod:
models:
- name: llama3
from: huggingface:huggingface.co/meta-llama/Llama-3.2-3B-Instruct
params:
hf_token: ${ secrets:SPICE_HUGGINGFACE_API_KEY }
- name: gpt-4o # Keep previous model.
models:
- name: llama3
from: huggingface:huggingface.co/meta-llama/Llama-3.2-3B-Instruct
params:
hf_token: ${ secrets:SPICE_HUGGINGFACE_API_KEY }
- name: gpt-4o # Keep previous model.
-
Verify models are loaded:
You should see both models listed:
NAME FROM STATUS
gpt-4o openai:gpt-4o ready
llama3 huggingface:huggingface.co/meta-llama/Llama-3.3-70B-Instruct ready
NAME FROM STATUS
gpt-4o openai:gpt-4o ready
llama3 huggingface:huggingface.co/meta-llama/Llama-3.3-70B-Instruct ready
-
Restart the Spice app:
-
Test the larger model or run another eval:
-
Run evaluations on both models:
# Run eval with GPT-4
curl -XPOST "http://localhost:8090/v1/evals/mimic-user-queries" \
-H "Content-Type: application/json" \
-d '{"model": "gpt-4o"}' | jq
# Run eval with LLaMA
curl -XPOST "http://localhost:8090/v1/evals/mimic-user-queries" \
-H "Content-Type: application/json" \
-d '{"model": "llama3"}' | jq
# Run eval with GPT-4
curl -XPOST "http://localhost:8090/v1/evals/mimic-user-queries" \
-H "Content-Type: application/json" \
-d '{"model": "gpt-4o"}' | jq
# Run eval with LLaMA
curl -XPOST "http://localhost:8090/v1/evals/mimic-user-queries" \
-H "Content-Type: application/json" \
-d '{"model": "llama3"}' | jq
-
Compare model performance:
SELECT
model,
total_queries,
correct_answers,
ROUND(accuracy * 100, 2) as accuracy_percentage
FROM model_stats
ORDER BY accuracy_percentage DESC;
SELECT
model,
total_queries,
correct_answers,
ROUND(accuracy * 100, 2) as accuracy_percentage
FROM model_stats
ORDER BY accuracy_percentage DESC;
This query will show:
- Total number of queries processed
- Number of correct answers
- Accuracy percentage as a percentage
You can use these metrics to decide if the smaller model provides acceptable performance for your use case.
Full Spicepod Configuration
Include the following spicepod.yml
for reference:
version: v1beta1
kind: Spicepod
name: demo
dependencies:
- Jeadie/evals
runtime:
task_history:
captured_output: truncated
views:
- name: user_queries
sql: |
SELECT
json_get_json(input, 'messages') AS input,
json_get_str((captured_output -> 0), 'content') as ideal
FROM runtime.task_history
WHERE task='ai_completion'
evals:
- name: mimic-user-queries
description: |
Evaluates how well a model can copy the exact answers already returned to a user. Useful for testing if a smaller/cheaper model is sufficient.
dataset: user_queries
scorers:
- match
models:
- name: gpt-4o
from: openai:gpt-4o
params:
openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
- name: llama3
from: huggingface:huggingface.co/meta-llama/Llama-3.2-3B-Instruct
params:
hf_token: ${ secrets:SPICE_HUGGINGFACE_API_KEY }
version: v1beta1
kind: Spicepod
name: demo
dependencies:
- Jeadie/evals
runtime:
task_history:
captured_output: truncated
views:
- name: user_queries
sql: |
SELECT
json_get_json(input, 'messages') AS input,
json_get_str((captured_output -> 0), 'content') as ideal
FROM runtime.task_history
WHERE task='ai_completion'
evals:
- name: mimic-user-queries
description: |
Evaluates how well a model can copy the exact answers already returned to a user. Useful for testing if a smaller/cheaper model is sufficient.
dataset: user_queries
scorers:
- match
models:
- name: gpt-4o
from: openai:gpt-4o
params:
openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
- name: llama3
from: huggingface:huggingface.co/meta-llama/Llama-3.2-3B-Instruct
params:
hf_token: ${ secrets:SPICE_HUGGINGFACE_API_KEY }