Let’s assume that you’re well-off enough to have an entire lake in your possession. What do you do next? Build a lakehouse, of course.
You know, an open architecture for managing your organization’s data that combines the scale of data lakes and the ACID-friendly queries of data warehouses on a single, flexible, and cost-effective platform.
We’re talking about a platform to handle the vast quantities of an organization’s data here, not your second (or third) house where you store your pontoon boat and only visit two weekends every year.
See also: Governance in the Age of Cloud Databases
The data lakehouse is a growing market segment, with companies like Dremio, Databricks, and Onehouse already elbowing for the best cloud implementation of open frameworks like Apache Hudi, Apache Iceberg, and Delta Lake. But before jumping straight into the supposed benefits of the lakehouse, let’s talk about how the industry got here, to a new product category, just as it seemed like data lakes were catching on.
Years ago, the data warehouse was the standard for business intelligence and analytics. Organizations stored their structured data in an ACID-compliant environment, which refers to the atomicity, consistency, isolation, and durability of the warehouse’s data. For all the benefits they created in terms of data quality and driving business analytics, they were costly, and their inflexibility tended to create silos.
The data lake was developed as an answer to these problems. As a central, “flat” repository of all raw structured and unstructured data in object form, the data lake was designed to make data more accessible to more employees without the risk of siloing. Data lakes tend to run cheaper than warehouses since most public clouds support the object storage model.
But many organizations, especially those at the leading edge of data storage and analysis, started to notice problems with data warehouses and lakes, even after trying to solve their individual cons by combining them into a single management and analysis infrastructure.
Back in 2014, Uber was struggling with their data warehouse, according to Vinoth Chandar, who managed the company’s data team at the time. They realized that different business units had different “versions” of the company’s data. Some analyses included the most recent updates, while others didn’t, which meant their people made critical decisions based on false or outdated assumptions.
Uber’s engineers started building a custom Hadoop infrastructure around their warehouse, effectively combining their data warehouse with a data lake, to help different teams run analytics and make decisions based on the data they were paying handsomely to collect and store. Internally, they called this project “Hoodie.”
In parallel with Uber, developers from Netflix, Apple, and Salesforce started working on a different open-source framework for democratizing the enormous volume of data they were all collecting about their customers. With both warehouses and lakes, these companies often needed to copy data to other systems to help their employees run analytics in comfortable, ACID-compliant environments where they didn’t have to worry about affecting durability. They were being overrun with complexity.
They started building what’s now called Iceberg, an open-source format for big data analytics that lets multiple engines work on the same tables, at the same time, with the “reliability and simplicity of SQL tables.”
Developers behind both projects eventually released them into open source, following a trend long-established in Silicon Valley tech giants. Back in 2011, Yahoo spun Hadoop out into its own company, and in 2014, LinkedIn did the same with Kafka. Both Hoodie—how called Hudi—and Iceberg are part of the Apache Software Foundation, where they’re maintained and built by a global network of volunteer contributors.
Hudi is now supported on AWS, Google Cloud, and Microsoft Azure and is used by companies like Disney, Twitter, Walmart, and more.
They’re also now the foundation of the data lakehouse industry. When deployed into production against new or existing data sets, these tools let organizations store all their structured and unstructured data on low-cost storage, just like data lakes do. They also combine data structure/management features in warehouses, like ACID-compliant transactions and simpler query development.
By combining the benefits of warehouses and lakes, the lakehouse lets organizations utilize their massive quantities of unstructured data with the speed and reliability of a warehouse. That’s a new foundation for data democratization—an organization’s entire workforce, from developers to marketers to salespeople, running business and machine learning (ML) analytics on large quantities of data stored in a single, stable place.
The lakehouse’s pitch is compelling, which is why the market is heating up fast. Back in February, Onehouse netted an $8 million seed round to build an open-source data lakehouse based on Hudi with a managed service in the offing. Earlier this year, Dremio raised $150 million in its Series E to extend its product, partially based on Iceberg. The company recently made the free edition of its cloud service generally available for enterprises. Databricks, which also maintains its own open-source Delta Lake architecture, claims more than 450 partners and multicloud support.
But, like all lakehouses, there’s the hype cycle and price tag to account for, which likely locks out small- or mid-sized companies for the time being. In the meantime, they’ll have to settle for a swim in the lake.