The “Data Lakehouse” architecture combines the benefits of data warehouses with the flexibility and scalability of cloud data lakes. But what is a “Lakehouse” exactly, other than a chameleon-like marketing term for vendors’ offerings (1, 2, 3, 4, 5, 6, 7)?
To better understand why there is excitement behind Lakehouses, this blog will try to set aside marketing and focus specifically on open table formats like Apache Iceberg and Delta Lake and how they enable these new Lakehouse architectures on top of a data lake.
Data lakes were built as a repository for semi-structured data, with an initial focus on scalability and cost. First generation data lakes were based on Hadoop’s HDFS but are now centered around cloud object stores. In the 2010s, the dominant standard for organizing and querying on data lakes became Apache Hive and its associated table format. The Hive table format suffers performance limitations at scale and falls short of the schema and SQL governance available on data warehouses.
What is a “Table Format”? What is a “Lakehouse”?
Databricks originated the “Lakehouse” term and recently described the contribution of the Delta Lake table format as:
Delta Lake enables warehouse-style features such as ACID transactions, time travel, audit logging, and fast metadata operations over tabular datasets.
The two most popular open table formats, Apache Iceberg and Delta Lake, both enable the above-mentioned data warehouse semantics on top of data lakes.
Table formats define how a set of metadata and data files represent a table’s state at various points in time. Importantly, this means table formats are NOT engines or services, but rather a convention for writing data and metadata files such that a database-like commit log can be reconstructed.
A table format is a specific piece of technology, whereas the Lakehouse architecture is a broad marketing term for an architecture that combines the benefits of data lakes and data warehouses. The table format is the key enabling technology that brings data warehouse advantages to the data lake.
The Data Lakehouse architecture can be viewed as a cloud data lake with the addition of a modern table format like Apache Iceberg or Delta Lake, or a proprietary table format as in Snowflake’s case. The rest of this blog will focus on the Iceberg and Delta table formats and what properties they have in common. Consult other Iceberg and Delta blogs to learn more about the differences.
Table Format Example
The folder structure below illustrates an Iceberg table that consists of several data files from two different commits. These files could be on a filesystem like HDFS or an object store like AWS S3.
The above example is an Iceberg table, but Delta tables have a similar structure with json-encoded metadata stored in a “_delta_log/” subdirectory.
Importantly, in this table there are two different snapshots, or views of the data. In order for a client to correctly interpret the table, the metadata in the avro files indicate schema information and which data files were added or removed at each commit. The contents of these metadata files are crucial for enabling the benefits discussed later.
The table format encompasses everything in the above folder structure, with the exception of atomic table updates. These table formats were designed for object stores, like AWS S3, which do not have the necessary atomic operations to safely commit updates from multiple concurrent writers. In this case, the table format relies on an external service to handle transactional locks such as the Hive Metastore (and others) in the case of Iceberg or dynamoDB for Delta Lake. Transactional updates work as-is for filesystems like HDFS with an atomic rename operation.
Why Table Formats Improve Data Lakes
The advantages of table formats fall into three broad categories.
Improved Data Warehouse Semantics
- Transactional updates to data
- Support table evolution: schema changes, adding, removing, and renaming columns, and changing partitioning
- Support time-travel queries (i.e., snapshotting table state)
Performance at Scale
- Support rewriting small files more efficiently
- Query planning performance improvements by file skipping and removing expensive listing operations
Ease of Adoption
- Requires no new hardware or software: table formats run on existing storage systems (HDFS, S3) alongside existing data and most engines (Spark, Flink, Trino, et al) are already compatible
Advantages in More Detail
Transactional Table Updates
Table formats ensure that table updates are ACID transactional, i.e. “all or nothing.”
Updates to a table can come in the form of periodic large batches, deletion of older data, or small single-row additions. Additionally, these updates can occur across partitions, resulting in many individual files being added to the underlying table.
But querying a table while an update is in-progress should never expose a subset of the new data as this can easily yield incorrect results. For example, data may suddenly appear skewed because some partitions have been updated with new data but others have not. With transactional table format updates, changes to the table are atomic and queries will not see inconsistent data views.
Table formats enable schemas to change gracefully over time without requiring rewrites.
Datasets naturally evolve over time in many ways: new columns introduced, columns removed or renamed, and changing field data types. In a data lake, these types of schema changes are brittle and often require rewriting the entire table.
To illustrate, consider a table with a column that is renamed several times over the years and has had its type widened from float to double. Older data files contain a mix of the different column names and data types. Table formats track each change and can correctly present data written in older schemas without needing to rewrite files.
Time Travel Queries
Time travel provides the ability to query a table as it was at a specific and repeatable point in time, even if the table has changed since that time.
As an example, when an ML model trained a month ago starts performing poorly in production, it may require debugging to understand what went wrong. On a traditional data lake, there is no way to exactly reproduce the same view of the data, only to guess at a reasonable time-based filter. With time-travel, it is possible to repeat a query against exactly the same dataset as the original query.
Small File Compaction
Table formats support periodically rewriting the internal table representation to more efficient form.
Inevitably, large datasets and tables are physically stored as a collection of many files or objects. But too many small files negatively impacts performance for several reasons: metadata overheads, inefficient columnar encodings, lower compression, and query planning. High latency object stores exacerbate these problems.
Yet there is always a tradeoff between ingesting data quickly for near-real time analytics and waiting for larger batches for more efficient ingestion. The ability to periodically compact small files/objects allows data engineers to have it both ways: near-real time ingestion without sacrificing long-term query performance.
Improved Query Performance
The best way to improve query performance at scale is to do less work. Table formats 1) avoid expensive listing operations and 2) read as few bytes as possible.
With Hive tables, a query needs to issue a LIST to enumerate all the objects (or files) in each partition the query touches and then at minimum read the Parquet footer of each object to determine if the rest of the object needs to be read.
But with table formats, the metadata about the table, including partition contents and even per-file column statistics, is encoded together in the commit log. Listings are no longer needed for discovery, avoiding what is often a metadata-intensive and slow storage operation. Further, many data files can be skipped entirely based on column statistics kept by the table format, which is especially useful when rows are sorted or clustered within partitions.
Now, the determination of which storage bytes must be read for the query can happen by checking a small, bounded number of metadata objects. At petabyte scales, this is dramatically less work.
As an example, consider a query that filters on a partition column and a sort column in a large table. This used to require a LIST and then a GET for each of the 1000s of objects in the partition. But with table format metadata, the query planner can know the exact one parquet object that contains the rows matching both conditions and then only read that from storage.
Open table formats like Apache Iceberg and Delta Lake add the schema management and ACID transactions of data warehouses to the scalability, flexibility, and economics of cloud data lakes. The table format is arguably the key technology pillar that separates the existing data lake from the new class of “Data Lakehouse” architectures. With broad support across query engines like Spark, Trino, and Flink and side-by-side compatibility with existing object or HDFS storage, it is now easy to start adopting these open table formats.