What is a Data Lakehouse?

A *lakehouse* is a new, open architecture that combines the best elements of *data* lakes and *data* warehouses.

A data lakehouse might be the next step in data storage and processing, combining the best of data warehouse and data lake architecture into a new system that is built for the next decade of technological development.

When data lakes were first introduced by Pentaho CTO James Dixon, experts in the field were split between the potential value of lakes as a fix to some of the issues with standard data warehouse solutions and what appeared to be simply a marketing term for a set of products built around the Hadoop system.

Some also took issue with the potential for data silos, caused by a data lakes ability to store and process all types of data, whether structured, semi-structured or unstructured. That concern was warranted, with an entire industry springing up over the last decade to accommodate the huge influx in unstructured data.

Data lakes have improved in value and sophistication over the past few years, which some consider a comeback for an architecture. Others perceive that data lakes have evolved into what Databricks and Snowflake are both claiming to have coined data lakehouses.

“The lakehouse is a new data management architecture that radically simplifies enterprise data infrastructure and accelerates innovation in an age when machine learning is poised to disrupt every industry,” said Ali Ghodsi, CEO of Databricks. “In the past most of the data that went into a company’s products or decision making was structured data from operational systems, whereas today, many products incorporate AI in the form of computer vision and speech models, text mining, and others. Why use a lakehouse instead of a data lake for AI? A lakehouse gives you data versioning, governance, security and ACID properties that are needed even for unstructured data.”

In Databricks’ overview of the topic, it illustrates how data lakehouse architecture embeds a metadata and governance layer during data processing. This means that data from a diverse set of data can be processed and stored in a unitary system, which improves accessibility for everyone in an organization.

Accessibility is important, as it is one of the key issues of previous generation data storage and processing solutions. With a data lakehouse, different departments in an organization can get access to datasets without having to go through the engineering department, which can improve productivity and enable deeper analysis of the data.

Another benefit of the data lakehouse is additional security, as organizations can limit access to documents without the worry of additional copies being made. This level of control, down to the column or row level, is very difficult to achieve once data is offloaded to a data warehouse or stored in multiple areas.

With an open unitary system, organizations can also connect third-party analytics, visualization, and other tools directly to the data source, which can enable businesses to see analysis and visualization as close to real-time as possible.