What is Vortex? Columnar File Format for Analytics

What is Vortex?

Vortex is an open-source compressed columnar file format designed for analytical queries. It is a Linux Foundation project licensed under Apache 2.0, and uses adaptive encoding to deliver faster scans and better compression than traditional columnar formats.

See Spice acceleration

Read the docs

Analytical workloads -- dashboards, AI pipelines, federated queries -- depend on fast, efficient reads over large columnar datasets. Apache Parquet has been the standard columnar file format for over a decade, but its design predates many of the techniques that modern hardware and query engines can exploit: memory-mapped I/O, zero-copy reads, and adaptive per-column encoding.

Vortex is a new open-source columnar file format built from scratch to take advantage of these capabilities. It is a Linux Foundation project licensed under Apache 2.0, and is designed specifically for the demands of analytical query engines that need to scan, filter, and aggregate data as fast as possible.

How Vortex Works

At its core, Vortex stores data in a columnar layout -- each column is stored independently, so a query that only needs three columns out of fifty reads only those three. This is the same fundamental principle behind Parquet, ORC, and other columnar formats. Where Vortex diverges is in how it encodes and compresses the data within each column.

Adaptive Encoding

Traditional columnar formats apply a single encoding scheme per column (or per row group). Parquet, for example, uses dictionary encoding for low-cardinality columns and falls back to plain encoding otherwise. The encoding is chosen at write time and remains fixed.

Vortex takes a different approach: it uses a cascading encoding system that adapts to the actual data distribution within each column segment. Rather than selecting a single encoding, Vortex can layer multiple encodings on top of each other:

Dictionary encoding for columns with repeated values
Run-length encoding (RLE) for columns with consecutive repeated values
Frame-of-reference (FOR) encoding for columns with values clustered around a base
Bit-packing for integer columns that don't use the full bit width
Delta encoding for monotonically increasing sequences like timestamps
Constant encoding for segments where every value is the same

The encoding selection happens per column segment, not per column. A single column can use different encodings for different portions of the data, depending on the local distribution. This means Vortex consistently achieves better compression ratios than formats that apply a single encoding globally.

Zero-Copy Reads and Memory Mapping

Vortex is designed for zero-copy reads from memory-mapped files. When a query engine accesses a Vortex file, it can memory-map the file and read encoded data directly without first decompressing the entire column into a separate buffer. The encodings are designed so that common operations -- scanning, filtering, aggregation -- can operate directly on the encoded representation.

This is a significant architectural difference from Parquet, where data must be fully decompressed and decoded before a query engine can process it. With Vortex, decompression is lazy: only the data actually needed by the query is decoded, and only at the point of use.

Random Access Without Full Decompression

Parquet supports predicate pushdown through row group statistics (min/max values), but once a row group is selected, the entire column chunk must be decompressed to access individual values. Vortex supports fine-grained random access within encoded segments. A query that needs a single value from a column can locate and decode just that value without decompressing the surrounding data.

This property is particularly valuable for point lookups, late materialization, and any query pattern where only a small fraction of the data in a column is actually needed.

Vortex vs. Apache Parquet

Both Vortex and Parquet are columnar file formats, and they share the same goal: efficient storage and retrieval of analytical data. The differences are in execution.

Encoding flexibility: Parquet uses a fixed set of encodings chosen at write time. Vortex uses adaptive, cascading encodings that vary per column segment based on data distribution. This gives Vortex consistently better compression ratios across diverse data types.

Decompression model: Parquet requires full decompression of column chunks before processing. Vortex supports lazy decompression and can operate on encoded data directly, reducing memory usage and improving scan performance.

Random access: Parquet's smallest addressable unit is a column chunk within a row group. Vortex supports finer-grained access within encoded segments, enabling efficient point lookups and late materialization.

Memory mapping: Vortex is designed for zero-copy reads from memory-mapped files. Parquet was not designed with memory mapping as a primary access pattern, though some implementations (like DuckDB's Parquet reader) add this capability at the reader level.

Ecosystem maturity: Parquet has broad ecosystem support -- virtually every data tool can read and write Parquet. Vortex is a newer format, and adoption is growing across downstream projects, including Spice Cayenne, LangChain's SmithDB, and PolarSignals. For interchange between systems, Parquet remains the standard. For performance-critical acceleration workloads, Vortex offers measurable advantages.

Vortex vs. Other Columnar Formats

Vortex vs. Lance

Lance is a columnar format designed for machine learning workloads, with a focus on versioned datasets and fast vector search. Vortex is designed for general analytical query performance with an emphasis on scan speed and compression efficiency. Lance optimizes for ML-specific access patterns (random row access, version management); Vortex optimizes for the scan-filter-aggregate patterns common in SQL analytics.

Vortex vs. Apache ORC

ORC (Optimized Row Columnar) is the Hive ecosystem's columnar format. Like Parquet, ORC uses fixed encodings chosen at write time. Vortex's adaptive encoding system and lazy decompression give it performance advantages for scan-heavy workloads. ORC is tightly integrated with the Hadoop ecosystem; Vortex is designed for modern, Rust-native query engines.

Vortex vs. Apache Arrow IPC

Arrow IPC is an in-memory serialization format for Apache Arrow arrays. It is designed for zero-copy data exchange between processes, not for persistent storage with compression. Vortex is a storage format that achieves high compression while preserving the ability to operate on encoded data. They serve different purposes: Arrow IPC for inter-process communication, Vortex for on-disk analytical storage.

Performance Characteristics

Vortex's design yields several measurable performance benefits:

Faster scan times: Lazy decompression means the query engine avoids decoding data that is filtered out early. For selective queries (those that touch a small fraction of rows), this translates to significantly faster scans compared to formats that require full decompression.
Better compression ratios: Adaptive encoding that varies per column segment consistently achieves smaller file sizes than fixed-encoding formats on the same data. Smaller files mean less I/O, which compounds the scan speed improvement.
Lower memory usage: Zero-copy reads from memory-mapped files eliminate the need to allocate separate buffers for decompressed data. The working memory footprint of a Vortex-backed query is proportional to the data actually accessed, not the total column size.
Efficient point lookups: Random access within encoded segments enables efficient lookups without scanning or decompressing surrounding data.

These characteristics are most impactful in acceleration workloads -- scenarios where data is cached locally for fast, repeated access by SQL federation queries, dashboards, or AI pipelines.

How Spice Uses Vortex

Spice uses Vortex as the storage format for its Cayenne data accelerator. When data is accelerated in Spice -- cached locally from remote sources like PostgreSQL, Databricks, or Amazon S3 -- it is stored in Vortex format on disk or in memory.

This means that federated queries that hit the acceleration layer benefit from Vortex's lazy decompression, adaptive encoding, and zero-copy reads. The result is sub-second query performance over locally cached data, even for datasets that would be too large to hold fully decompressed in memory.

The acceleration layer is kept synchronized with source systems using change data capture, so the Vortex-encoded local cache always reflects the current state of the source data.

Cayenne and Vortex

Cayenne is Spice's next-generation data accelerator, purpose-built for high-scale analytical workloads. It uses Vortex as its underlying storage format and adds:

Incremental updates: When source data changes, only the affected segments are re-encoded. The entire dataset does not need to be rewritten.
Tiered storage: Hot data is memory-mapped for zero-copy access. Warm data is stored on local disk. The tiering is transparent to the query engine.
Integration with Apache DataFusion: Cayenne exposes Vortex-encoded data as DataFusion table providers, so the query engine can push filters and projections directly into the storage layer.

When to Use Vortex

Vortex is the right choice when:

Query performance is the priority: If you need the fastest possible scan, filter, and aggregate performance over columnar data, Vortex's adaptive encoding and lazy decompression provide measurable improvements over Parquet.
Data is cached locally for acceleration: Vortex is designed for the acceleration use case -- caching remote data locally for fast repeated access.
Memory efficiency matters: Zero-copy reads and lazy decompression reduce the memory footprint of analytical workloads.

Parquet remains the better choice for data interchange between systems, archival storage in data lakes, and any scenario where broad ecosystem compatibility is more important than raw scan performance.

Advanced Topics

Encoding Selection Algorithms

Vortex does not rely on manual encoding hints or fixed heuristics. Instead, it uses a cost-based encoding selection algorithm that evaluates each column segment against the available encoding schemes and selects the combination that minimizes a weighted objective of compressed size and expected decode cost.

The algorithm works in two phases. First, it profiles a column segment to compute statistics: cardinality, run lengths, value range, null density, and sort order. Second, it evaluates each candidate encoding against these statistics. Dictionary encoding is favored when cardinality is low relative to segment length. Run-length encoding is favored when there are long runs of consecutive identical values. Frame-of-reference encoding is favored when values fall within a narrow range. Bit-packing is favored when the effective bit width is significantly smaller than the storage type's bit width. Delta encoding is favored for monotonically increasing or decreasing sequences.

The cost model accounts for both storage efficiency (bytes per value after encoding) and query performance (estimated CPU cycles to decode a value). This trade-off matters because a highly compressed encoding that is expensive to decode may be slower in practice than a moderately compressed encoding that supports fast scans. The algorithm selects the encoding that optimizes for the expected query workload, which is scan-heavy by default.

Cascading Encodings

One of Vortex's distinguishing features is its support for cascading (layered) encodings. Rather than choosing a single encoding per segment, Vortex can stack multiple encodings in sequence. For example, a timestamp column with mostly increasing values might first be delta-encoded (converting absolute timestamps to small deltas), and then the resulting delta values might be bit-packed (since the deltas require fewer bits than the original timestamps).

Cascading works because each encoding transforms data into a representation that may be more amenable to further encoding. The encoding selection algorithm evaluates multi-layer combinations, not just individual encodings. It uses a bounded search to avoid exponential blowup -- typically evaluating up to two or three layers, since additional layers rarely yield significant benefit.

This approach lets Vortex achieve compression ratios that no single encoding can match. In practice, cascading is most effective on numeric columns with structured patterns -- timestamps, auto-incrementing IDs, sensor readings with bounded variation -- where the first encoding removes most of the entropy and the second encoding compresses the residual.

Lazy Decompression and Pushdown

Vortex's lazy decompression model is more than a performance optimization -- it changes which operations are possible at the storage layer. Because encoded data retains enough structure for certain operations, Vortex supports compute pushdown into the encoding layer itself.

For example, a filter predicate like WHERE timestamp > '2026-01-01' on a delta-encoded column can be evaluated without fully decoding the column. The storage layer translates the predicate threshold into the delta domain (by computing the delta relative to the base value) and evaluates it against the encoded representation. Only segments that pass the filter are decoded to Arrow arrays for further processing.

Similarly, min/max statistics and null counts are maintained per segment in the Vortex file metadata. The query engine uses these statistics for segment pruning -- skipping entire segments that cannot contain matching rows -- before any data is read from disk.

This pushdown capability is exposed to Apache DataFusion through the TableProvider interface. When DataFusion pushes filters and projections down to a Vortex table provider, the provider evaluates them at the encoding level, reads and decodes only the qualifying segments, and returns the results as Arrow record batches. The query engine never sees or processes data that was pruned at the storage layer.

Segment Layout and Metadata

Vortex files are organized into segments, each containing a contiguous range of rows for a single column. Segment boundaries are chosen based on a target size (typically 64 KB to 1 MB of encoded data), balancing between fine-grained pruning and metadata overhead. Each segment stores its encoding type, compressed data, null bitmap, and lightweight statistics (row count, null count, min, max).

The file footer contains a segment index that maps row ranges to segment offsets. This index enables efficient range scans and point lookups: the query engine binary-searches the index to find the relevant segments, reads only those segments from disk, and decodes them. For memory-mapped files, this translates to a small number of page faults rather than a sequential scan of the entire column.

Vortex FAQ

How does Vortex compare to Apache Parquet?

Both are columnar file formats for analytical data, but they differ in encoding and decompression. Parquet uses fixed encodings chosen at write time and requires full decompression before processing. Vortex uses adaptive, cascading encodings that vary per column segment and supports lazy decompression -- only the data actually needed by a query is decoded. This gives Vortex faster scan times and better compression ratios for most workloads, while Parquet has broader ecosystem support.

Is Vortex open source?

Yes. Vortex is a Linux Foundation project and is licensed under Apache 2.0. The source code, documentation, and issue tracker are publicly accessible.

When should I use Vortex instead of Parquet?

Use Vortex when query performance and memory efficiency are priorities -- particularly for data acceleration workloads where data is cached locally for fast, repeated access. Use Parquet when you need broad ecosystem compatibility or are storing data in a data lake for interchange between multiple tools.

Can Vortex be used with existing data tools?

Vortex is used in multiple systems, including Spice Cayenne, LangChain's SmithDB, and PolarSignals. It is not yet a drop-in replacement for Parquet in arbitrary data pipelines. For data interchange, Parquet remains the standard. Within Spice, Vortex is used automatically when data acceleration is enabled.

What is the relationship between Vortex and Apache Arrow?

Vortex is designed to work with Apache Arrow. When Vortex data is decoded, it produces Arrow arrays that can be processed by any Arrow-compatible query engine. Vortex also supports operating on encoded data directly (without decoding to Arrow) for operations like filtering and aggregation, which is where its performance advantages come from. Arrow IPC is an in-memory interchange format; Vortex is a compressed on-disk storage format.

Learn more about Vortex and data acceleration

Guides and blog posts on columnar storage, data acceleration, and query performance with Spice.

Docs

Spice.ai OSS Documentation

Learn how Spice uses Vortex to accelerate queries with adaptive encoding, lazy decompression, and zero-copy reads.

Blog

Introducing Spice Cayenne: The Next-Generation Data Accelerator Built on Vortex for Performance and Scale

Spice Cayenne is the next-generation Spice.ai data accelerator built for high-scale and low latency data lake workloads.

Blog

How we use Apache DataFusion at Spice AI

A technical overview of how Spice extends Apache DataFusion with custom table providers, optimizer rules, and UDFs.

Talk to an engineer

See Spice in action

Walk through your use case with an engineer and see how Spice handles federation, acceleration, and AI integration for production workloads.

Talk to an engineer