/demo/README.md
lukekim/demo/README.md

Spice.ai Demo App

This is a Spice.ai data and AI app.

Prerequisites

  • Spice.ai CLI installed
  • OpenAI API key
  • Hugging Face API token (optional, for LLaMA model)
  • curl and jq for API calls

Learn More

To learn more about Spice.ai, take a look at the following resources:

Connect with us on Discord - your feedback is appreciated!


Demo Steps

Publishing a Spice App in the Cloud

Step 1: Forking and Using the Dataset

  1. Fork the repository https://github.com/jeadie/evals into your GitHub org.

Step 2: Creating a New App in the Cloud

  1. Log into the Spice.ai Cloud Platform and create a new app called evals. The app will start empty.
  2. Connect the app to your repository:
    • Go to the App Settings tab and select Connect Repository.
    • If the repository is not yet linked, follow the prompts to authenticate and link it.

Step 3: Deploying the App

  1. Set the app to Public:
    • Navigate to the app's settings and toggle the visibility to public.
  2. Redeploy the app:
    • Click Redeploy to load the datasets and configurations from the repository.

Step 4: Verifying and Testing

  1. Check the datasets in the Spice.ai Cloud:
    • Verify that the datasets are correctly loaded and accessible.
  2. Test public access:
    • Log in with a different account to confirm the app is accessible to external users.

Initializing a Local Spice App

  1. Initialize a new local Spice app

    mkdir demo
    cd demo
    spice init
    mkdir demo
    cd demo
    spice init
  2. Login to Spice.ai Cloud

    spice login
    spice login
  3. Get spicepod from Spicerack Navigate to spicerack.org, search for evals.

    image

Click on /evals, click on Use this app, and copy the spice connect command.

image

Paste the command into the terminal. Navigate to spicerack.org, search for evals, click on /evals, click on Use this app, and copy the spice connect command. Paste the command into the terminal.

spice connect <username>/evals
spice connect <username>/evals

The spicepod.yml should be updated to:

version: v1beta1
kind: Spicepod
name: demo

dependencies:
- Jeadie/evals
version: v1beta1
kind: Spicepod
name: demo

dependencies:
- Jeadie/evals
  1. Add a model to the spicepod

    models:
    - name: gpt-4o
    from: openai:gpt-4o
    params:
    openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
    models:
    - name: gpt-4o
    from: openai:gpt-4o
    params:
    openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
  2. Start spice

    spice run
    spice run
  3. Run an eval

    curl -XPOST "http://localhost:8090/v1/evals/taxes" -H "Content-Type: application/json" -d '{
    "model": "gpt-4o"
    }' | jq
    curl -XPOST "http://localhost:8090/v1/evals/taxes" -H "Content-Type: application/json" -d '{
    "model": "gpt-4o"
    }' | jq
  4. Explore incorrect results

    spice sql
    spice sql
    SELECT
    input,
    output,
    actual
    FROM eval.results
    WHERE value=0.0 LIMIT 5;
    SELECT
    input,
    output,
    actual
    FROM eval.results
    WHERE value=0.0 LIMIT 5;

Optional: Create an Eval to Use a Smaller Model

  1. Track the outputs of all AI model calls:

    runtime:
    task_history:
    captured_output: truncated
    runtime:
    task_history:
    captured_output: truncated
  2. Define a new view and evaluation:

    views:
    - name: user_queries
    sql: |
    SELECT
    json_get_json(input, 'messages') AS input,
    json_get_str((captured_output -> 0), 'content') as ideal
    FROM runtime.task_history
    WHERE task='ai_completion'
    - name: latest_eval_runs
    sql: |
    SELECT model, MAX(created_at) as latest_run
    FROM eval.runs
    GROUP BY model
    - name: model_stats
    sql: |
    SELECT
    r.model,
    COUNT(*) as total_queries,
    SUM(CASE WHEN res.value = 1.0 THEN 1 ELSE 0 END) as correct_answers,
    AVG(res.value) as accuracy
    FROM eval.runs r
    JOIN latest_eval_runs lr ON r.model = lr.model AND r.created_at = lr.latest_run
    JOIN eval.results res ON res.run_id = r.id
    GROUP BY r.model

    evals:
    - name: mimic-user-queries
    description: |
    Evaluates how well a model can copy the exact answers already returned to a user. Useful for testing if a smaller/cheaper model is sufficient.
    dataset: user_queries
    scorers:
    - match
    views:
    - name: user_queries
    sql: |
    SELECT
    json_get_json(input, 'messages') AS input,
    json_get_str((captured_output -> 0), 'content') as ideal
    FROM runtime.task_history
    WHERE task='ai_completion'
    - name: latest_eval_runs
    sql: |
    SELECT model, MAX(created_at) as latest_run
    FROM eval.runs
    GROUP BY model
    - name: model_stats
    sql: |
    SELECT
    r.model,
    COUNT(*) as total_queries,
    SUM(CASE WHEN res.value = 1.0 THEN 1 ELSE 0 END) as correct_answers,
    AVG(res.value) as accuracy
    FROM eval.runs r
    JOIN latest_eval_runs lr ON r.model = lr.model AND r.created_at = lr.latest_run
    JOIN eval.results res ON res.run_id = r.id
    GROUP BY r.model

    evals:
    - name: mimic-user-queries
    description: |
    Evaluates how well a model can copy the exact answers already returned to a user. Useful for testing if a smaller/cheaper model is sufficient.
    dataset: user_queries
    scorers:
    - match
  3. Add a smaller model to the spicepod:

    models:
    - name: llama3
    from: huggingface:huggingface.co/meta-llama/Llama-3.2-3B-Instruct
    params:
    hf_token: ${ secrets:SPICE_HUGGINGFACE_API_KEY }

    - name: gpt-4o # Keep previous model.
    models:
    - name: llama3
    from: huggingface:huggingface.co/meta-llama/Llama-3.2-3B-Instruct
    params:
    hf_token: ${ secrets:SPICE_HUGGINGFACE_API_KEY }

    - name: gpt-4o # Keep previous model.
  4. Verify models are loaded:

    spice models
    spice models

    You should see both models listed:

    NAME FROM STATUS
    gpt-4o openai:gpt-4o ready
    llama3 huggingface:huggingface.co/meta-llama/Llama-3.3-70B-Instruct ready
    NAME FROM STATUS
    gpt-4o openai:gpt-4o ready
    llama3 huggingface:huggingface.co/meta-llama/Llama-3.3-70B-Instruct ready
  5. Restart the Spice app:

    spice run
    spice run
  6. Test the larger model or run another eval:

    spice chat
    spice chat
  7. Run evaluations on both models:

    # Run eval with GPT-4
    curl -XPOST "http://localhost:8090/v1/evals/mimic-user-queries" \
    -H "Content-Type: application/json" \
    -d '{"model": "gpt-4o"}' | jq

    # Run eval with LLaMA
    curl -XPOST "http://localhost:8090/v1/evals/mimic-user-queries" \
    -H "Content-Type: application/json" \
    -d '{"model": "llama3"}' | jq
    # Run eval with GPT-4
    curl -XPOST "http://localhost:8090/v1/evals/mimic-user-queries" \
    -H "Content-Type: application/json" \
    -d '{"model": "gpt-4o"}' | jq

    # Run eval with LLaMA
    curl -XPOST "http://localhost:8090/v1/evals/mimic-user-queries" \
    -H "Content-Type: application/json" \
    -d '{"model": "llama3"}' | jq
  8. Compare model performance:

    spice sql
    spice sql
    SELECT
    model,
    total_queries,
    correct_answers,
    ROUND(accuracy * 100, 2) as accuracy_percentage
    FROM model_stats
    ORDER BY accuracy_percentage DESC;
    SELECT
    model,
    total_queries,
    correct_answers,
    ROUND(accuracy * 100, 2) as accuracy_percentage
    FROM model_stats
    ORDER BY accuracy_percentage DESC;

    This query will show:

    • Total number of queries processed
    • Number of correct answers
    • Accuracy percentage as a percentage

    You can use these metrics to decide if the smaller model provides acceptable performance for your use case.


Full Spicepod Configuration

Include the following spicepod.yml for reference:

version: v1beta1
kind: Spicepod
name: demo

dependencies:
- Jeadie/evals

runtime:
task_history:
captured_output: truncated

views:
- name: user_queries
sql: |
SELECT
json_get_json(input, 'messages') AS input,
json_get_str((captured_output -> 0), 'content') as ideal
FROM runtime.task_history
WHERE task='ai_completion'

evals:
- name: mimic-user-queries
description: |
Evaluates how well a model can copy the exact answers already returned to a user. Useful for testing if a smaller/cheaper model is sufficient.
dataset: user_queries
scorers:
- match

models:
- name: gpt-4o
from: openai:gpt-4o
params:
openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }

- name: llama3
from: huggingface:huggingface.co/meta-llama/Llama-3.2-3B-Instruct
params:
hf_token: ${ secrets:SPICE_HUGGINGFACE_API_KEY }
version: v1beta1
kind: Spicepod
name: demo

dependencies:
- Jeadie/evals

runtime:
task_history:
captured_output: truncated

views:
- name: user_queries
sql: |
SELECT
json_get_json(input, 'messages') AS input,
json_get_str((captured_output -> 0), 'content') as ideal
FROM runtime.task_history
WHERE task='ai_completion'

evals:
- name: mimic-user-queries
description: |
Evaluates how well a model can copy the exact answers already returned to a user. Useful for testing if a smaller/cheaper model is sufficient.
dataset: user_queries
scorers:
- match

models:
- name: gpt-4o
from: openai:gpt-4o
params:
openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }

- name: llama3
from: huggingface:huggingface.co/meta-llama/Llama-3.2-3B-Instruct
params:
hf_token: ${ secrets:SPICE_HUGGINGFACE_API_KEY }