What is Apache Iceberg? Open Table Format for Data Lakes

What is Apache Iceberg?

Apache Iceberg is an open table format designed for large analytic datasets. Originally developed at Netflix to address the limitations of Hive tables, Iceberg provides reliable, performant table semantics on top of data lakes -- enabling schema evolution, hidden partitioning, time travel, and ACID transactions at petabyte scale.

See Spice SQL federation

Read the docs

Data lakes store vast amounts of data in open file formats like Parquet and ORC on object storage systems such as Amazon S3, Google Cloud Storage, and Azure Blob Storage. But file formats alone do not define a table. Without a table format layer, there is no reliable way to track which files belong to a table, handle concurrent writes, evolve schemas, or query consistent snapshots of the data.

Apache Iceberg fills this gap. It sits between the query engine and the underlying data files, providing the metadata management, consistency guarantees, and optimization capabilities that turn a collection of files into a proper table. Iceberg is engine-agnostic -- it works with Spark, Trino, Flink, Dremio, Snowflake, and other query engines, including federated SQL engines like Spice.

How Table Formats Work

A table format defines the contract between a query engine and the data stored on disk. It answers fundamental questions: which files make up the current version of a table, what the schema is, how the data is partitioned, and how to read a consistent snapshot while writes are in progress.

Without a table format, a query engine scanning a directory of Parquet files has no way to distinguish between files that are part of a completed write and files that are partially written. It cannot track schema changes over time, and it cannot read a consistent snapshot of the data if another process is writing simultaneously.

Iceberg solves these problems through a layered metadata architecture that tracks every file, every schema version, and every change to the table as an immutable snapshot.

Iceberg Architecture

Iceberg's architecture consists of three layers: the catalog, the metadata layer, and the data layer.

The Catalog

The Iceberg catalog is the entry point for table discovery. It maps table names to the location of the table's current metadata pointer. When a query engine opens an Iceberg table, it first consults the catalog to find the path to the current metadata file.

Iceberg supports multiple catalog implementations including the Hive Metastore, AWS Glue, a REST catalog, and JDBC-based catalogs. The catalog's role is intentionally minimal -- it stores only a pointer to the current metadata file, not the metadata itself. This design allows the metadata layer to be fully self-contained in the storage system.

The Metadata Layer

The metadata layer is the core of Iceberg's design. It consists of three types of files stored alongside the data:

Metadata files contain the table's schema, partition spec, sort order, properties, and a reference to the current snapshot. Each write operation creates a new metadata file. The catalog pointer is then updated atomically to reference the new metadata file.

Manifest lists (also called snapshot manifests) enumerate all the manifest files that make up a particular snapshot of the table. Each snapshot has exactly one manifest list. The manifest list tracks which manifests were added, which were removed, and summary statistics like row counts and partition boundaries.

Manifest files contain the actual file-level metadata: the path to each data file, the file format, the partition values, column-level statistics (min/max values, null counts, value counts), and the file size. Manifest files are the key to Iceberg's query planning performance -- the engine can prune entire files from a scan based on the statistics without opening the files themselves.

This three-level hierarchy -- metadata file, manifest list, manifest file -- enables efficient planning even for tables with millions of data files. The engine can skip entire manifests based on partition boundaries and skip individual files based on column statistics.

The Data Layer

The data layer consists of the actual data files stored in formats like Apache Parquet, Apache ORC, or Apache Avro. Iceberg does not prescribe a single file format -- it works with any format that supports its required features (schema projection, predicate pushdown).

In practice, Parquet is the most common format used with Iceberg because of its efficient columnar compression and broad engine support.

Key Features

Schema Evolution

Iceberg supports full schema evolution without rewriting data files. You can add columns, drop columns, rename columns, reorder columns, and widen types (e.g., int to long) -- all as metadata-only operations. The existing data files are not modified.

This is possible because Iceberg assigns a unique ID to each column and tracks the mapping between column IDs and names across schema versions. When a query reads data files written with an older schema, Iceberg uses the column ID mapping to resolve the correct columns, filling in null for columns that did not exist when the file was written.

Hidden Partitioning

Traditional Hive-style partitioning requires users to know the partition scheme and include partition columns in every query. If data is partitioned by year(event_date), the user must add WHERE year(event_date) = 2026 to benefit from partition pruning. If they write WHERE event_date > '2026-01-01', the Hive table performs a full scan.

Iceberg decouples the logical query from the physical partitioning. Partition transforms (year, month, day, hour, truncate, bucket) are defined at the table level and applied automatically. A query with WHERE event_date > '2026-01-01' automatically benefits from partition pruning because Iceberg's planner evaluates the filter against the partition metadata. Users do not need to know or reference the partition scheme.

Iceberg also supports partition evolution -- changing the partition scheme without rewriting existing data. New data is written with the new partition spec, and queries that span both old and new partitions work correctly because the manifest files track which partition spec applies to each data file.

Time Travel

Every write to an Iceberg table creates a new snapshot. Snapshots are immutable and retained according to the table's expiration policy. This enables time travel queries -- reading the table as it existed at a specific point in time or a specific snapshot ID.

Time travel is useful for reproducible analytics, debugging data pipeline issues, and auditing. An analyst can compare the current state of a table with its state from last week to understand what changed. A data engineer can roll back a table to a previous snapshot if a bad write corrupts the data.

-- Query the table as of a specific timestamp
SELECT * FROM orders TIMESTAMP AS OF '2026-03-01 00:00:00';

-- Query a specific snapshot
SELECT * FROM orders VERSION AS OF 12345678;

ACID Transactions

Iceberg provides serializable isolation for write operations through optimistic concurrency control. Each write operation reads the current metadata, computes the new metadata, and then attempts to atomically swap the catalog pointer from the old metadata file to the new one. If another writer has modified the table in the meantime, the operation detects the conflict and retries.

This approach enables concurrent readers and writers without locking. Readers always see a consistent snapshot -- they follow the metadata pointer that was current when they started their scan. Writers produce new snapshots that become visible atomically when the catalog pointer is updated.

File-Level Statistics and Pruning

Each manifest file in Iceberg records column-level statistics for every data file: minimum values, maximum values, null counts, and value counts. The query planner uses these statistics to skip files that cannot contain matching rows.

For a query like WHERE amount > 1000, the planner checks the max(amount) statistic for each file. If a file's maximum amount is 500, the file is pruned from the scan. This file-level pruning can eliminate the majority of I/O for selective queries on large tables, without requiring the files to be partitioned on the filter column.

Iceberg vs. Other Table Formats

Iceberg vs. Hive Tables

Hive tables were the original table format for data lakes. A Hive table is essentially a directory of files registered in the Hive Metastore, with partitions mapped to subdirectories (e.g., year=2026/month=03/).

Hive's limitations motivated Iceberg's creation:

Partition discovery is directory-based. The metastore tracks partitions, but file-level metadata is not maintained. Listing all files in a partition requires expensive cloud storage listing operations.
No file-level statistics. Hive cannot prune individual files within a partition. All files in a matching partition must be scanned.
No schema evolution safety. Schema changes in Hive can break existing data files because there is no column ID mapping.
No snapshot isolation. Concurrent reads and writes can produce inconsistent results because there is no atomic metadata update mechanism.
Partition changes require data rewriting. Changing the partition scheme of a Hive table requires rewriting all the data.

Iceberg addresses every one of these limitations. It tracks individual files with column-level statistics, supports safe schema evolution through column IDs, provides snapshot isolation through immutable metadata, and allows partition evolution without data rewrites.

Iceberg vs. Delta Lake

Delta Lake is another open table format, originally developed at Databricks. Both Iceberg and Delta Lake solve the same core problems -- ACID transactions, schema evolution, and time travel on data lakes -- but they differ in architecture and ecosystem alignment.

Delta Lake uses a transaction log (the _delta_log directory) consisting of JSON and Parquet checkpoint files. Iceberg uses its three-level metadata hierarchy (metadata files, manifest lists, manifest files). Delta Lake's log-based approach can require compaction of the transaction log for large tables, while Iceberg's manifest-based approach provides consistent planning performance regardless of write history.

Iceberg has broader engine support -- it is natively integrated with Spark, Trino, Flink, Dremio, Snowflake, and others. Delta Lake has the deepest integration with Databricks and Spark, with growing support in other engines through the Delta Kernel project.

Both formats are actively developed and production-ready. The choice often depends on the primary query engine and cloud platform in use. For a detailed comparison, see Apache Iceberg vs. Delta Lake.

Iceberg vs. Apache Hudi

Apache Hudi (Hadoop Upserts Deletes and Incrementals) was designed primarily for streaming ingestion with upsert and incremental processing capabilities. Hudi provides two storage types: Copy-on-Write (CoW), which rewrites files on each update, and Merge-on-Read (MoR), which writes deltas and merges them at read time.

Iceberg and Hudi overlap in providing ACID transactions and time travel, but they emphasize different workloads. Hudi excels at high-frequency streaming upserts, particularly on Spark. Iceberg's design prioritizes read performance, engine-agnostic compatibility, and metadata scalability for tables with millions of files. Iceberg's position-delete and equality-delete mechanisms handle updates without the tight coupling to a specific write engine that Hudi's architecture requires.

How Spice Uses Apache Iceberg

Spice connects to Iceberg tables as a federated data source through its SQL federation layer. Users can query Iceberg tables stored in Amazon S3, Google Cloud Storage, or Azure Blob Storage through Spice's unified SQL interface -- alongside data from PostgreSQL, MySQL, Databricks, Snowflake, and 30+ other sources.

When Spice queries an Iceberg table, it reads the Iceberg metadata to identify the relevant data files, applies partition pruning and file-level statistics pruning based on the query's predicates, and uses predicate pushdown to minimize the amount of data read from object storage. This means a query with a WHERE clause on a partitioned or statistics-rich column reads only the files that could contain matching rows.

For workloads that require sub-second query performance, Spice can accelerate Iceberg tables by caching them locally. The acceleration layer stores a local copy of the data that can be queried without round-trips to object storage. Combined with change data capture, the local cache stays synchronized with the source Iceberg table.

This approach lets development teams use Iceberg as their primary data lake table format while querying it through Spice alongside operational databases and other analytical stores -- all through a single SQL interface powered by Apache DataFusion.

-- Query an Iceberg table in S3 alongside a PostgreSQL table through Spice
SELECT o.order_id, o.amount, c.name
FROM iceberg.orders o
JOIN postgres.customers c ON o.customer_id = c.id
WHERE o.created_at > '2026-01-01'
ORDER BY o.amount DESC
LIMIT 100;

Advanced Topics

Metadata Scalability and Planning Performance

One of Iceberg's key design goals is planning performance at scale. Tables in production data lakes can contain millions of data files across thousands of partitions. Listing files through directory-based approaches (as Hive does) becomes prohibitively slow at this scale -- cloud storage listing operations have high latency and limited throughput.

Iceberg's manifest-based approach eliminates directory listing entirely. The manifest list for a snapshot enumerates all manifest files, and each manifest file enumerates its data files with full statistics. The planner reads these metadata files (which are typically small and can be cached) and builds the file scan plan without any directory listing. This makes planning time proportional to the number of manifest files, not the number of data files.

For extremely large tables, Iceberg supports manifest pruning based on partition bounds stored in the manifest list. If a query filters on a partitioned column, the planner can skip entire manifests that contain no matching partitions, further reducing planning overhead.

Row-Level Operations: Deletes and Updates

Iceberg supports row-level deletes and updates through two mechanisms: position deletes and equality deletes.

Position deletes specify the file path and row position of each deleted row. They are efficient for targeted deletes where the engine already knows which rows to remove (e.g., after a join-based deduplication). Position deletes are fast to apply because the reader can skip specific row positions during file scans.

Equality deletes specify the column values that identify deleted rows. They are useful when the exact file and row position are not known, such as when deleting all rows where user_id = 123. Equality deletes are more flexible but slower to apply because the reader must evaluate the delete predicate against every row.

In Iceberg v2, both delete mechanisms produce delete files that are tracked in the manifest alongside data files. Over time, accumulating delete files degrades read performance. Table maintenance operations (compaction) merge delete files with data files to produce new, clean data files.

Table Maintenance: Compaction and Expiration

Iceberg tables require periodic maintenance to ensure optimal read performance:

Compaction rewrites small data files into larger ones and merges delete files with data files. Without compaction, a table that receives many small writes accumulates thousands of small files, each adding overhead to the scan plan. Compaction reduces file count, eliminates delete files, and improves compression efficiency.

Snapshot expiration removes old snapshots and their associated metadata and data files. While time travel requires retaining historical snapshots, indefinite retention increases storage costs and metadata size. Expiration policies define how long snapshots are retained (e.g., 7 days) before they are eligible for garbage collection.

Orphan file cleanup removes data files that are not referenced by any snapshot. Orphan files can result from failed write operations that produced data files but did not successfully commit the metadata update. Periodic cleanup prevents these files from accumulating and consuming storage.

Iceberg REST Catalog Protocol

The Iceberg REST catalog protocol defines a standard HTTP API for catalog operations: listing namespaces and tables, loading table metadata, and committing metadata updates. The REST catalog is becoming the preferred catalog implementation because it decouples the catalog from a specific backend (Hive Metastore, AWS Glue) and allows catalog providers to implement custom authorization, caching, and multi-tenancy logic behind the standard API.

The protocol uses optimistic concurrency for commits -- the client sends the current metadata location along with the new metadata, and the server rejects the commit if the current metadata has changed since the client read it. This enables safe concurrent writes through any REST-compatible catalog without requiring distributed locking.

Apache Iceberg FAQ

What is the difference between Apache Iceberg and Delta Lake?

Both are open table formats that provide ACID transactions, schema evolution, and time travel on data lakes. Iceberg uses a three-level metadata hierarchy (metadata files, manifest lists, manifest files) while Delta Lake uses a JSON/Parquet transaction log. Iceberg has broader native engine support across Spark, Trino, Flink, and Snowflake. Delta Lake has the deepest integration with Databricks and Spark. The choice often depends on your primary query engine and ecosystem.

What is hidden partitioning in Apache Iceberg?

Hidden partitioning decouples the physical partition scheme from user queries. Partition transforms (year, month, day, hour, truncate, bucket) are defined at the table level and applied automatically during query planning. Users write standard filter expressions like WHERE event_date > 2026-01-01, and Iceberg automatically prunes partitions without the user needing to know or reference the partition columns. This eliminates a common source of user errors and full-table scans in Hive-partitioned tables.

How does time travel work in Apache Iceberg?

Every write to an Iceberg table creates an immutable snapshot. Iceberg retains historical snapshots according to a configurable expiration policy. Time travel queries reference a specific timestamp or snapshot ID to read the table as it existed at that point. This enables reproducible analytics, data pipeline debugging, and rollback to a known good state if a write introduces incorrect data.

Can I query Apache Iceberg tables with Spice?

Yes. Spice connects to Iceberg tables as a federated data source. You can query Iceberg tables in S3, GCS, or Azure Blob Storage through Spice unified SQL interface alongside data from PostgreSQL, MySQL, Databricks, and 30+ other sources. Spice applies partition pruning and predicate pushdown when reading Iceberg metadata, and supports optional local acceleration for sub-second query performance.

Does Apache Iceberg support schema evolution?

Yes. Iceberg supports adding, dropping, renaming, and reordering columns, as well as widening types, all as metadata-only operations. Existing data files are not rewritten. Iceberg assigns a unique ID to each column and uses these IDs to resolve schemas across file versions, so queries against files written with older schemas return correct results with null values for columns that did not exist at write time.

Learn more about Iceberg and Spice

Technical resources on how Spice connects to Apache Iceberg tables for federated SQL queries.

Docs

Spice.ai OSS Documentation

Learn how Spice connects to Apache Iceberg tables for federated SQL queries with predicate pushdown and local acceleration.

Blog

How we use Apache DataFusion at Spice AI

A technical overview of how Spice extends Apache DataFusion with custom table providers, optimizer rules, and UDFs.

Talk to an engineer

See Spice in action

Walk through your use case with an engineer and see how Spice handles federation, acceleration, and AI integration for production workloads.

Talk to an engineer