What is a Data Lakehouse?

The Data Warehouse and the Data Lake both have their strengths and weaknesses. Like yin and yang, they often coexist within the same data stack and this has given rise to a hybrid category — the Data Lakehouse!

Pedram Navid

April 13, 2021

8 minutes

Transaction support (ACID compliance)

Schema enforcement and governance

Business intelligence (BI) support

And the following are properties of a data lakehouse that separate it from a data warehouse:

Decoupling of storage and computing
Support for data types, ranging from structured to unstructured
Support for all kinds of workloads. Data science, machine learning (ML), and analytics tools can all use the same repository
End-to-end streaming

Let's pause and take a closer look at that last set of capabilities.

Decoupled Storage and Computing

If decoupling storage from computing gives you déjà vu, it's not without good reason. The lines between a data lake and a data warehouse have been blurring for years—we just didn't call it a data lakehouse. Not only have modern enterprise data warehouses (EDWs) become faster and cloud-based, they have been slowly creeping toward the data lake in terms of new features and cheap storage.

Let's look at two popular solutions offering cheap storage that scales independently from computing power.

Google BigQuery has been around for over a decade. Look under the hood of Google's cloud EDW solution, and you'll find a strict separation of Dremel, the query engine, and Colossus, where all data is stored in the ColumnIO file format, supporting fast access for Dremel workloads.

From a Google Cloud blog article from 2016: "Colossus allows BigQuery users to scale to dozens of petabytes in storage seamlessly, without paying the penalty of attaching much more expensive compute resources—typical with most traditional databases."

With Snowflake's successful IPO in 2020, it's easy to forget that their EDW solution is almost a decade old. Here too is the separation of storage and computing at the core of the product.

And that has never been different. From a 2014 Snowflake whitepaper: "In a traditional data warehouse, storage, compute, and database services are tightly coupled…Snowflake's separation of storage, compute, and system services makes it possible to dynamically modify the configuration of the system. Resources can be sized and scaled independently and transparently, on-the-fly."

These two examples prove that the decoupling of storage and computing isn’t really that much of a differentiating capability of the data lakehouse.

Semi-structured and Unstructured Data

That brings us to support for semi-structured and unstructured data. On top of JSON support, BigQuery has its nested field type. Snowflake handles semi-structured formats such as JSON, XML, Parquet, and Avro as a VARIANT field.

While it’s correct that genuine support for unstructured data—such as images, video, or sound—is not supported, many businesses don't have a need for it, either because it's unrelated to their market or they don’t have aspirations to develop AI applications like computer vision in-house.

Moreover, the possibility of storing raw and semi-structured data in an EDW is responsible for flipping the typical workflow upside-down. If storage is cheap, and structure is not a prerequisite, there’s no reason for manipulating the data before storing it in a data warehouse. It even spawned a new set of popular tools and an accompanying job role tailored for ELT workflows within modern EDWs like Snowflake and BigQuery. dbt (data build tool) and Dataform are the weapons of choice of the analytics engineer.

All Kinds of Workloads

With analytical query engines at their core, BigQuery and Snowflake are best-of-breed solutions for analytical workloads. Nevertheless, both vendors encourage their users to utilize their data repositories for all kinds of workloads, including machine learning.

Even more so, BigQuery ML abates the nuisances that come with the machine-learning workflow by allowing end users to engineer features and utilize a spectrum of unsupervised and supervised machine-learning methods within BigQuery SQL.

End-to-end Streaming

And then finally, streaming. While Snowflake does not support a native solution for receiving streaming data, it has an easy-to-set-up middleware known as Snowpipe that listens for changes in a storage bucket and sends the data into the warehouse.

BigQuery, on the other hand, supports streaming inserts natively via its API.

So…what exactly is a Data Lakehouse?

It's easy to see what sets the lakehouse apart from the traditional data lake. But from the modern data warehouse? Not so much.

As I've shown, data warehouses have long been crossing over to their data lake counterparts and adopting a variety of their capabilities. Google even lists “convergence of data warehouse and data lake” on BigQuery’s product page. Consequently, from a functional perspective, the lakehouse isn't fundamentally different from a modern data warehouse.

The main differentiator is of a technical nature: the data lakehouse adheres to the data lake paradigm.

At its core, there is a storage layer such as HDFS or cloud storage.
On top of that, there is a query engine such as Apache Spark, Apache Drill, Amazon Athena, or Presto.
Finally, the third layer is a storage abstraction framework, such as Delta Lake or Hudi,` that adds additional capabilities like ACID compliance and schema enforcement.

Consequently, it’s the direction of the crossover that defines the data lakehouse.

Finally, the data type support remains much broader than nested and tabular objects, including audio, video, and time series. If those formats are your concern, the data lake(house) continues to be the most flexible destination for simply dumping your data, both stream and batch.

Conclusion

Is your organization assessing whether it should embrace a data lakehouse or a data warehouse? Or does it want to simplify its stack by eliminating the lake or the warehouse?

As I've shown, both modern warehouses and lakehouses share a lot of the same capabilities. If you don't care about what's under the hood or which paradigm should be abided by, the degree of unstructuredness that you require should be your primary consideration.

What is a Data Lakehouse?

Decoupled Storage and Computing

Semi-structured and Unstructured Data

All Kinds of Workloads

End-to-end Streaming

So…what exactly is a Data Lakehouse?

Conclusion

More on the blog

Activate your data in less than 5 minutes

What is a Data Lakehouse?self.__wrap_n!=1&&self.__wrap_b(":R136cdaiukm:",1)

Decoupled Storage and Computing

Semi-structured and Unstructured Data

All Kinds of Workloads

End-to-end Streaming

So…what exactly is a Data Lakehouse?

Conclusion

More on the blog

Activate your data in less than 5 minutesself.__wrap_n!=1&&self.__wrap_b(":R4pktaiukm:",1)

What is a Data Lakehouse?

Activate your data in less than 5 minutes