title: 'Filesystem Model Deployment Guide' sidebar_label: 'Deployment Guide' description: 'Operating guide for filesystem-loaded models in production: formats, device selection, memory footprint, and observability.' sidebar_position: 10 pagination_prev: null pagination_next: null tags:
Production operating guide for loading local language models from the filesystem (GGUF, safetensors, ONNX).
The Filesystem model provider has no authentication layer. Access control is enforced by the operating system:
PersistentVolumeClaim or a model-serving sidecar.For sensitive models (proprietary weights, fine-tunes with PII training data), restrict filesystem ACLs to the Spice process user and encrypt the volume at rest.
The Filesystem model provider reads local files synchronously. There is no network layer, retry logic, or remote backoff. Failures surface as filesystem errors (ENOENT, EACCES, EIO).
Model loading happens once at startup. A missing or unreadable file fails the spicepod load; fix the underlying cause and restart.
| Format | Extension | Notes |
|---|---|---|
| GGUF | .gguf | Quantized / unquantized; loaded via the mistral local loader. |
| GGML (legacy) | .ggml | Legacy llama.cpp format. |
| Safetensors | .safetensors | Native tensor format; preferred over .bin for safety. |
| PyTorch | .bin / .pt / .pth | Legacy PyTorch checkpoints. |
| ONNX | .onnx | Supported via the tract runtime for classical ML models. |
Local inference uses the first available backend in order:
Install the CUDA-enabled Spice build on GPU hosts for materially better throughput on models over a few billion parameters.
Model file size on disk is close to RAM / VRAM footprint at load. Quantized GGUF models (Q4, Q5, Q8) reduce footprint roughly proportional to their bit-width. For a 7B-parameter model:
f16: ~14 GBQ8: ~7.5 GBQ5: ~5 GBQ4: ~4 GBAdd ~20–30% headroom for KV cache and working memory during inference.
The runtime rate limiter defaults to max_concurrency=1 for local models — local inference is compute-bound and benefits little from request-level parallelism on a single accelerator. Override via max_concurrency for multi-GPU / large-core CPU hosts.
Shared LLM metrics apply (see the OpenAI Model Deployment Guide for the full metric list): llm_requests, llm_failures, llm_internal_request_duration_ms, llm_prompt_tokens_total, llm_completion_tokens_total.
See Component Metrics for enabling and exporting metrics.
Local inference operations emit ai_completion spans (and health spans for probes) in task history, mirroring the shared model spans. captured_output and token usage fields are logged.
model_type can force a known architecture.| Symptom | Likely cause | Resolution |
|---|---|---|
No such file or directory | Path typo or missing mount. | Verify the file exists in the Spice process's filesystem (ls inside the container). |
Permission denied | Spice user lacks read on the file. | Adjust ACLs or mount with appropriate UID/GID. |
Model fails to load with unsupported architecture | Loader cannot infer architecture from filename. | Set model_type explicitly. |
| OOM on load | File size exceeds device memory. | Use a smaller / more quantized variant; move to CPU with more RAM; split across GPUs if supported. |
| Inference falls back to CPU unexpectedly | CUDA / Metal not available. | Use a CUDA-enabled Spice build on GPU hosts; for macOS, use the Apple Silicon build. |
| Slow first inference after startup | JIT / weight-quantization warmup. | Issue a warmup request at startup; subsequent calls are hot. |