What is LLM Inference? How Language Models Generate Text

What is LLM Inference?

LLM inference is the process of running input through a trained large language model to generate a response -- the forward pass that transforms a prompt into tokens, predictions, and output text.

See Spice LLM inference

Read the docs

Training a large language model and running inference on it are two fundamentally different operations. Training adjusts billions of parameters over weeks or months using massive datasets and GPU clusters. Inference uses those fixed parameters to generate output from a single input -- and it needs to happen in milliseconds.

For most developers, inference is the only interaction point with a model. Whether you are building a chatbot, a code assistant, a search pipeline, or an autonomous agent, the quality and speed of inference determines the user experience. Understanding how inference works -- and what levers you have to optimize it -- is essential for building production AI systems.

How LLM Inference Works

Inference is the forward pass through a trained neural network. For large language models, this process has four stages: tokenization, the forward pass, sampling, and detokenization.

Tokenization

The model does not process raw text. Before inference begins, the input prompt is converted into a sequence of tokens -- integer IDs that map to subword units in the model's vocabulary. A tokenizer splits text into these units based on a learned vocabulary (typically 32,000 to 128,000 tokens).

For example, the sentence "SQL federation queries multiple databases" might be tokenized into ["SQL", " feder", "ation", " queries", " multiple", " databases"], where each piece maps to an integer ID. The model operates entirely on these integer sequences.

The Forward Pass

The token IDs are converted into dense vector embeddings and passed through the model's transformer layers. Each layer applies self-attention (computing relationships between all tokens in the sequence) and feed-forward transformations. For a model like Llama 3 70B, this means passing through 80 transformer layers with 64 attention heads each.

The output of the final layer is a probability distribution over the entire vocabulary for the next token. This is the core computation: given a sequence of tokens, predict the probability of every possible next token.

Sampling

The raw probability distribution is processed by a sampling strategy to select the next token. Common strategies include:

Greedy decoding: Always select the highest-probability token. Deterministic but can produce repetitive output.
Temperature sampling: Scale the probability distribution by a temperature parameter. Lower temperatures (e.g., 0.2) make the distribution sharper, favoring high-probability tokens. Higher temperatures (e.g., 1.0) flatten the distribution, increasing diversity.
Top-k sampling: Restrict selection to the k most probable tokens, then sample from that subset.
Top-p (nucleus) sampling: Restrict selection to the smallest set of tokens whose cumulative probability exceeds p (e.g., 0.9), then sample.

The choice of sampling strategy affects output quality, creativity, and consistency. For structured tasks like code generation or SQL queries, low temperature with top-p sampling typically produces the best results. For creative writing or brainstorming, higher temperatures introduce useful variation.

Detokenization

The selected token ID is mapped back to its text representation using the tokenizer's vocabulary. This token is appended to the output and -- critically -- fed back into the model as part of the input for predicting the next token. This autoregressive loop continues until the model generates a stop token or reaches a maximum length.

This sequential, token-by-token generation is why LLM inference is inherently slower than a simple database query. Each new token requires a forward pass through the entire model.

Inference Performance Metrics

Four metrics define the performance profile of an LLM inference system:

Latency

Total time from request to complete response. For a chat application, this is how long the user waits. Latency is the sum of time-to-first-token plus the time to generate all subsequent tokens.

Time to First Token (TTFT)

The time between receiving a request and producing the first output token. TTFT is dominated by the prefill phase -- processing the entire input prompt through the model in a single forward pass. Longer prompts mean longer TTFT because the model must compute attention across all input tokens before generating any output.

For interactive applications, TTFT determines perceived responsiveness. A system with 200ms TTFT feels responsive even if total generation takes several seconds, because the user sees output beginning almost immediately.

Tokens Per Second (TPS)

The rate at which output tokens are generated after the first token. TPS measures the speed of the decode phase -- the autoregressive loop where each new token is generated one at a time. TPS is bounded by memory bandwidth rather than compute, because each decode step reads the full model weights from memory to generate a single token.

Throughput

The total number of tokens a system can generate per second across all concurrent requests. A system with 50 TPS per request serving 20 concurrent users has an aggregate throughput of 1,000 tokens per second. Throughput determines cost-efficiency: higher throughput means more work done per GPU-hour.

Inference vs. Training

Training and inference use the same model architecture but differ in nearly every operational dimension:

Direction: Training computes forward and backward passes (backpropagation) to update weights. Inference computes only the forward pass with fixed weights.
Compute profile: Training is compute-bound -- dominated by matrix multiplications across large batches. Inference (especially the decode phase) is memory-bandwidth-bound -- each token generation reads the full model weights but performs relatively little computation.
Hardware: Training requires clusters of high-end GPUs with fast interconnects (NVLink, InfiniBand). Inference can run on a single GPU, a CPU, or even edge devices depending on the model size and latency requirements.
Batching: Training uses large batch sizes (thousands of samples) for efficiency. Inference batches are constrained by latency requirements -- larger batches improve throughput but increase per-request latency.

Inference Optimization Techniques

Several techniques reduce the cost and latency of LLM inference without significantly affecting output quality.

KV Cache

During autoregressive generation, the model recomputes attention over all previous tokens at each step. The key-value (KV) cache stores the intermediate key and value tensors from previous tokens so they don't need to be recomputed. This turns each decode step from O(n^2) attention to O(n), dramatically reducing computation for long sequences.

The tradeoff is memory. For a 70B parameter model with a 4,096-token context, the KV cache can consume several gigabytes of GPU memory. Managing KV cache memory is one of the primary challenges in serving long-context models.

Quantization

Quantization reduces the precision of model weights from 16-bit floating point to 8-bit integers (INT8) or 4-bit integers (INT4). This reduces memory usage by 2-4x and increases inference speed because lower-precision operations are faster and the model reads less data from memory.

# Model memory usage at different precisions
# Llama 3 70B parameters:
#   FP16:  ~140 GB (70B params x 2 bytes)
#   INT8:  ~70 GB  (70B params x 1 byte)
#   INT4:  ~35 GB  (70B params x 0.5 bytes)

Modern quantization methods (GPTQ, AWQ, GGUF) minimize quality loss by calibrating quantization ranges against representative data. In practice, INT8 quantization produces output nearly indistinguishable from FP16 for most tasks. INT4 introduces measurable quality degradation but enables running large models on consumer hardware.

Speculative Decoding

Speculative decoding uses a small, fast draft model to generate several candidate tokens quickly, then verifies them in a single forward pass through the large target model. If the draft tokens are accepted (because the large model assigns them high probability), multiple tokens are produced in the time it would take to generate one.

This technique works well when the draft model's predictions frequently align with the target model -- which is common for straightforward text. Speculative decoding can improve TPS by 2-3x without any quality loss, because rejected draft tokens are replaced with the target model's output.

Continuous Batching

Traditional batching groups requests into fixed-size batches and processes them together. The problem: short requests finish early but wait for the longest request in the batch to complete, wasting GPU cycles.

Continuous batching (also called iteration-level batching) inserts new requests into the batch as soon as existing requests finish, keeping the GPU fully utilized. This improves throughput significantly -- frameworks like vLLM and TensorRT-LLM use continuous batching to serve 2-5x more concurrent requests on the same hardware.

Local vs. Cloud Inference

Developers choosing where to run inference face a set of tradeoffs:

Cloud/API inference (OpenAI, Anthropic, Google) provides access to the largest models without managing infrastructure. The tradeoffs are per-token cost, network latency, data privacy constraints, and vendor dependency. For prototyping and applications where the largest models are necessary, cloud inference is the practical starting point.

Local inference runs models on your own hardware -- GPUs, CPUs, or edge devices. This eliminates per-token cost, removes network latency, and keeps data private. The tradeoffs are hardware investment, model size limitations (you need enough memory to fit the model), and operational overhead. Quantized open-source models (Llama, Mistral, Qwen) make local inference increasingly practical for production workloads.

Hybrid approaches route requests to local or cloud models based on task complexity, latency requirements, or cost budgets. Simple classification or extraction tasks go to a fast, small local model. Complex reasoning tasks go to a large cloud model. This pattern optimizes for both cost and quality.

Inference for Embeddings vs. Generation

Not all LLM inference is text generation. Embedding inference runs input through a model to produce a dense vector representation rather than generating new tokens. Embedding models are used for semantic search, retrieval-augmented generation, clustering, and classification.

Embedding inference is fundamentally different from generative inference:

Single forward pass: Embeddings are produced in one pass through the model. There is no autoregressive loop, no sampling, no token-by-token generation.
Batch-friendly: Embedding requests can be batched aggressively because there is no sequential dependency between tokens.
Latency profile: Embedding latency scales with input length but is typically 10-100x faster than generating the same number of tokens, because there is no decode phase.

Production systems often run embedding and generative models side by side. A search query generates an embedding (fast, single-pass inference), which retrieves relevant documents, which are then fed to a generative model for synthesis (slower, autoregressive inference).

LLM Inference with Spice

Spice serves LLM inference alongside federated SQL queries, embedding search, and tool calling in a single runtime. This co-location means AI applications can:

Query data and run inference in one request: Retrieve context from databases via SQL federation, generate embeddings for hybrid search, and produce a response -- all through a single endpoint.
Route across models: Direct requests to local open-source models or cloud APIs based on task requirements, cost, and latency constraints.
Combine inference with tool use: Models served through Spice can invoke tools via the MCP gateway to access live data, execute queries, and take actions as part of the inference loop.
Observe everything: Distributed tracing across data queries, inference calls, and tool invocations provides full visibility into end-to-end AI workflows.

This unified approach eliminates the need to stitch together separate services for data access, model serving, and tool execution -- reducing operational complexity while improving latency through co-located processing.

Advanced Topics

The Inference Pipeline

A complete inference request passes through multiple stages, each with distinct performance characteristics and optimization opportunities.

The prefill phase processes all input tokens in parallel through the model's transformer layers, producing the KV cache and the first output token. The decode phase then generates tokens one at a time in an autoregressive loop, reading from and appending to the KV cache at each step. Prefill is compute-bound (matrix multiplications across the full input sequence), while decode is memory-bandwidth-bound (reading model weights for each single-token generation). Understanding this distinction is essential for choosing the right optimization strategy.

PagedAttention

The KV cache is the primary memory bottleneck in LLM serving. Traditional implementations pre-allocate a contiguous block of GPU memory for each request's KV cache based on the maximum possible sequence length. This leads to significant memory waste -- a request that generates 100 tokens still reserves memory for the full context window (e.g., 8,192 or 128,000 tokens).

PagedAttention, introduced by the vLLM project, applies virtual memory concepts from operating systems to KV cache management. Instead of allocating contiguous memory, it divides the KV cache into fixed-size blocks (pages) that are allocated on demand as new tokens are generated. Pages can be stored non-contiguously in GPU memory and mapped through a block table, similar to how a CPU's page table maps virtual addresses to physical memory.

The practical impact is substantial: PagedAttention reduces KV cache memory waste from 60-80% to near zero, enabling 2-4x more concurrent requests on the same GPU hardware. This directly translates to higher throughput and lower cost per token. PagedAttention also enables efficient memory sharing for techniques like parallel sampling and beam search, where multiple output sequences share the same input prefix.

Prefix Caching

Many inference workloads involve repeated prefixes. Chat applications prepend the same system prompt to every request. RAG systems share common instructions and formatting templates. API endpoints serving the same application reuse the same tool definitions and context structures.

Prefix caching stores the KV cache entries for common prefixes in GPU memory so they don't need to be recomputed for each request. When a new request arrives with a matching prefix, the system copies the cached KV entries (or references them via PagedAttention's block table) and only computes the prefill for the unique portion of the prompt.

For workloads where the shared prefix constitutes 50-90% of the input (common in production applications with long system prompts), prefix caching can reduce time-to-first-token by a corresponding 50-90%. This optimization is especially impactful for tool calling workloads where tool definitions are repeated across every request.

Inference Serving Architectures

Production inference serving systems must balance throughput, latency, and cost across diverse workload patterns. Two architectural approaches have emerged.

Model-parallel serving distributes a single large model across multiple GPUs using tensor parallelism (splitting layers across GPUs) or pipeline parallelism (assigning different layers to different GPUs). Tensor parallelism reduces per-token latency by parallelizing the computation within each layer, while pipeline parallelism increases throughput by processing different requests at different pipeline stages simultaneously.

Disaggregated serving separates the prefill and decode phases onto different hardware. Prefill is compute-bound and benefits from high-FLOPS GPUs, while decode is memory-bandwidth-bound and benefits from GPUs with high memory bandwidth. By routing prefill and decode to hardware optimized for each phase, disaggregated architectures can improve overall cost-efficiency by 30-50% compared to running both phases on the same hardware. This pattern is gaining adoption in large-scale serving systems where the workload justifies the additional routing complexity.

LLM Inference FAQ

What is the difference between LLM inference and training?

Training adjusts a model's parameters by computing forward and backward passes over large datasets, typically requiring GPU clusters and running for weeks. Inference uses the trained, fixed parameters to generate output from a single input in milliseconds to seconds. Training is compute-bound; inference (particularly the decode phase) is memory-bandwidth-bound.

What is the difference between TTFT and TPS?

Time to First Token (TTFT) measures how long it takes to produce the first output token, dominated by processing the input prompt (the prefill phase). Tokens Per Second (TPS) measures how fast subsequent tokens are generated during the decode phase. A system can have low TTFT but moderate TPS, or vice versa -- they are independent performance dimensions influenced by different bottlenecks.

Does quantization significantly reduce output quality?

INT8 quantization produces output that is nearly indistinguishable from full-precision (FP16) inference for most tasks, with minimal quality degradation. INT4 quantization introduces measurable quality loss, particularly on reasoning-heavy tasks, but enables running large models on significantly less hardware. Modern quantization methods (GPTQ, AWQ) minimize this loss by calibrating against representative data.

When should I use local inference vs. a cloud API?

Use cloud APIs when you need the largest models, want to avoid infrastructure management, or are prototyping. Use local inference when per-token cost, data privacy, or network latency are primary concerns. Many production systems use a hybrid approach: routing simple tasks to fast local models and complex tasks to large cloud models based on cost and quality requirements.

How does embedding inference differ from text generation?

Embedding inference produces a fixed-size vector representation of input text in a single forward pass -- no autoregressive token generation, no sampling. This makes it significantly faster and more batch-friendly than generative inference. Embedding inference is used for semantic search, retrieval-augmented generation (RAG), classification, and clustering, while generative inference produces free-form text output.

Learn more about LLM inference

Documentation and blog posts on serving LLM inference with Spice.

Docs

Spice AI Docs

Learn how to serve LLM inference alongside federated SQL queries, embeddings, and tool calling in a single Spice runtime.

Blog

A Developer's Guide to Understanding Spice.ai

Learn what Spice.ai is, when to use it, and how it solves enterprise data challenges. A developer-focused guide to federation, acceleration, search, and AI.

Blog

The Spice.ai for GitHub Copilot Extension is now available!

With the Spice.ai Extension, developers can interact with data from any external data source directly within GitHub Copilot.

Talk to an engineer

See Spice in action

Walk through your use case with an engineer and see how Spice handles federation, acceleration, and AI integration for production workloads.

Talk to an engineer