lukekim/demo/README.md

Spice.ai Demo App

This is a Spice.ai data and AI app.

Prerequisites

Spice.ai CLI installed
OpenAI API key
Hugging Face API token (optional, for LLaMA model)
curl and jq for API calls

Learn More

To learn more about Spice.ai, take a look at the following resources:

Connect with us on Discord - your feedback is appreciated!

Demo Steps

Publishing a Spice App in the Cloud

Step 1: Forking and Using the Dataset

Fork the repository https://github.com/jeadie/evals into your GitHub org.

Step 2: Creating a New App in the Cloud

Log into the Spice.ai Cloud Platform and create a new app called evals. The app will start empty.
Connect the app to your repository:
- Go to the App Settings tab and select Connect Repository.
- If the repository is not yet linked, follow the prompts to authenticate and link it.

Step 3: Deploying the App

Set the app to Public:
- Navigate to the app's settings and toggle the visibility to public.
Redeploy the app:
- Click Redeploy to load the datasets and configurations from the repository.

Step 4: Verifying and Testing

Check the datasets in the Spice.ai Cloud:
- Verify that the datasets are correctly loaded and accessible.
Test public access:
- Log in with a different account to confirm the app is accessible to external users.

Initializing a Local Spice App

Initialize a new local Spice app
Login to Spice.ai Cloud
Get spicepod from Spicerack Navigate to spicerack.org, search for evals.

Click on /evals, click on Use this app, and copy the spice connect command.

Paste the command into the terminal. Navigate to spicerack.org, search for evals, click on /evals, click on Use this app, and copy the spice connect command. Paste the command into the terminal.

The spicepod.yml should be updated to:

Add a model to the spicepod
Start spice
Run an eval
Explore incorrect results

Optional: Create an Eval to Use a Smaller Model

Track the outputs of all AI model calls:
Define a new view and evaluation:
Add a smaller model to the spicepod:
Verify models are loaded:

You should see both models listed:
Restart the Spice app:
Test the larger model or run another eval:
Run evaluations on both models:
Compare model performance:

This query will show:
- Total number of queries processed
- Number of correct answers
- Accuracy percentage as a percentage
You can use these metrics to decide if the smaller model provides acceptable performance for your use case.

Full Spicepod Configuration

Include the following spicepod.yml for reference:

mkdir demo
cd demo
spice init

mkdir demo
cd demo
spice init

spice login

spice login

spice connect <username>/evals

spice connect <username>/evals

version: v1beta1
kind: Spicepod
name: demo

dependencies:
  - Jeadie/evals

version: v1beta1
kind: Spicepod
name: demo

dependencies:
  - Jeadie/evals

models:
  - name: gpt-4o
    from: openai:gpt-4o
    params:
      openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }

models:
  - name: gpt-4o
    from: openai:gpt-4o
    params:
      openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }

spice run

spice run

curl -XPOST "http://localhost:8090/v1/evals/taxes"      -H "Content-Type: application/json"      -d '{
    "model": "gpt-4o"
  }' | jq

curl -XPOST "http://localhost:8090/v1/evals/taxes"      -H "Content-Type: application/json"      -d '{
    "model": "gpt-4o"
  }' | jq

spice sql

spice sql

runtime:
  task_history:
    captured_output: truncated

runtime:
  task_history:
    captured_output: truncated

views:
  - name: user_queries
    sql: |
      SELECT
        json_get_json(input, 'messages') AS input,
        json_get_str((captured_output -> 0), 'content') as ideal
      FROM runtime.task_history
      WHERE task='ai_completion'
  - name: latest_eval_runs
    sql: |
      SELECT model, MAX(created_at) as latest_run
         FROM eval.runs
         GROUP BY model
  - name: model_stats
    sql: |
      SELECT
        r.model,
        COUNT(*) as total_queries,
        SUM(CASE WHEN res.value = 1.0 THEN 1 ELSE 0 END) as correct_answers,
        AVG(res.value) as accuracy
      FROM eval.runs r
      JOIN latest_eval_runs lr ON r.model = lr.model AND r.created_at = lr.latest_run
      JOIN eval.results res ON res.run_id = r.id
      GROUP BY r.model

evals:
  - name: mimic-user-queries
    description: |
      Evaluates how well a model can copy the exact answers already returned to a user. Useful for testing if a smaller/cheaper model is sufficient.
    dataset: user_queries
    scorers:
      - match

views:
  - name: user_queries
    sql: |
      SELECT
        json_get_json(input, 'messages') AS input,
        json_get_str((captured_output -> 0), 'content') as ideal
      FROM runtime.task_history
      WHERE task='ai_completion'
  - name: latest_eval_runs
    sql: |
      SELECT model, MAX(created_at) as latest_run
         FROM eval.runs
         GROUP BY model
  - name: model_stats
    sql: |
      SELECT
        r.model,
        COUNT(*) as total_queries,
        SUM(CASE WHEN res.value = 1.0 THEN 1 ELSE 0 END) as correct_answers,
        AVG(res.value) as accuracy
      FROM eval.runs r
      JOIN latest_eval_runs lr ON r.model = lr.model AND r.created_at = lr.latest_run
      JOIN eval.results res ON res.run_id = r.id
      GROUP BY r.model

evals:
  - name: mimic-user-queries
    description: |
      Evaluates how well a model can copy the exact answers already returned to a user. Useful for testing if a smaller/cheaper model is sufficient.
    dataset: user_queries
    scorers:
      - match

models:
  - name: llama3
    from: huggingface:huggingface.co/meta-llama/Llama-3.2-3B-Instruct
    params:
      hf_token: ${ secrets:SPICE_HUGGINGFACE_API_KEY }

  - name: gpt-4o # Keep previous model.

models:
  - name: llama3
    from: huggingface:huggingface.co/meta-llama/Llama-3.2-3B-Instruct
    params:
      hf_token: ${ secrets:SPICE_HUGGINGFACE_API_KEY }

  - name: gpt-4o # Keep previous model.

spice models

spice models

NAME    FROM                                                         STATUS
gpt-4o  openai:gpt-4o                                                ready
llama3  huggingface:huggingface.co/meta-llama/Llama-3.3-70B-Instruct ready

NAME    FROM                                                         STATUS
gpt-4o  openai:gpt-4o                                                ready
llama3  huggingface:huggingface.co/meta-llama/Llama-3.3-70B-Instruct ready

spice run

spice run

spice chat

spice chat

# Run eval with GPT-4
curl -XPOST "http://localhost:8090/v1/evals/mimic-user-queries" \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-4o"}' | jq

# Run eval with LLaMA
curl -XPOST "http://localhost:8090/v1/evals/mimic-user-queries" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3"}' | jq

# Run eval with GPT-4
curl -XPOST "http://localhost:8090/v1/evals/mimic-user-queries" \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-4o"}' | jq

# Run eval with LLaMA
curl -XPOST "http://localhost:8090/v1/evals/mimic-user-queries" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3"}' | jq

spice sql

spice sql

version: v1beta1
kind: Spicepod
name: demo

dependencies:
  - Jeadie/evals

runtime:
  task_history:
    captured_output: truncated

views:
  - name: user_queries
    sql: |
      SELECT
        json_get_json(input, 'messages') AS input,
        json_get_str((captured_output -> 0), 'content') as ideal
      FROM runtime.task_history
      WHERE task='ai_completion'

evals:
  - name: mimic-user-queries
    description: |
      Evaluates how well a model can copy the exact answers already returned to a user. Useful for testing if a smaller/cheaper model is sufficient.
    dataset: user_queries
    scorers:
      - match

models:
  - name: gpt-4o
    from: openai:gpt-4o
    params:
      openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }

  - name: llama3
    from: huggingface:huggingface.co/meta-llama/Llama-3.2-3B-Instruct
    params:
      hf_token: ${ secrets:SPICE_HUGGINGFACE_API_KEY }

version: v1beta1
kind: Spicepod
name: demo

dependencies:
  - Jeadie/evals

runtime:
  task_history:
    captured_output: truncated

views:
  - name: user_queries
    sql: |
      SELECT
        json_get_json(input, 'messages') AS input,
        json_get_str((captured_output -> 0), 'content') as ideal
      FROM runtime.task_history
      WHERE task='ai_completion'

evals:
  - name: mimic-user-queries
    description: |
      Evaluates how well a model can copy the exact answers already returned to a user. Useful for testing if a smaller/cheaper model is sufficient.
    dataset: user_queries
    scorers:
      - match

models:
  - name: gpt-4o
    from: openai:gpt-4o
    params:
      openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }

  - name: llama3
    from: huggingface:huggingface.co/meta-llama/Llama-3.2-3B-Instruct
    params:
      hf_token: ${ secrets:SPICE_HUGGINGFACE_API_KEY }

SELECT
  input,
  output,
  actual
FROM eval.results
WHERE value=0.0 LIMIT 5;

SELECT
  input,
  output,
  actual
FROM eval.results
WHERE value=0.0 LIMIT 5;

SELECT
  model,
  total_queries,
  correct_answers,
  ROUND(accuracy * 100, 2) as accuracy_percentage
FROM model_stats
ORDER BY accuracy_percentage DESC;

SELECT
  model,
  total_queries,
  correct_answers,
  ROUND(accuracy * 100, 2) as accuracy_percentage
FROM model_stats
ORDER BY accuracy_percentage DESC;