Unifying the Data Warehouse and Data Lake Creates a New Analytical Rhythm

Unifying analytics is a drum I’ve been beating for a while now, and the flow of evolving analytics requirements and industry disruptions is beating it even louder. The data warehouse and the data lake are merging into one architecture, whether it’s called a unified analytics warehouse (UAW), a data lake house, or an Integrated Data Analytics Platform. The data management and analytics industry may not agree what to call it, but we agree that a single platform for all analytics – descriptive, diagnostic, predictive or prescriptive – will become more and more the norm over time. Business intelligence and data science are the right and left hand, and companies need both to keep a good rhythm.

One thing my company, Vertica, and I have tried to do is pull the focus away from “where data will be stored,” which is the subject of a lot of the talk about this architectural change, to “how we do analytics.” Based on a smart analogy my boss, Joy King, used, I wrote an article here on RTInsights about delivering analytics the way Amazon delivers packages, with the focus on delivering analytics that meets the needs of data consumers, not focusing on where the data (or the package) is stored. 

Some folks are still asking me, what’s the big deal? Why are we re-designing how we do analytics? What could possibly be so amazing it’s worth changing everything? Well, let’s start with the baseline beats.

Advantages of a Data Warehouse

Like the drummer is the anchor of a band, the beating heart of a data warehouse architecture is a powerful analytical database that powers business intelligence with ANSII standard SQL plus a fair number of analytical functions. That database should have these strengths:

  • High Performance
  • High Concurrency
  • Reliability
  • Security
  • Governance

Advantages of a Data Lake

The baseline pulse of a data lake is flexible scale-out data storage that powers data science with a variety of frameworks data scientists are accustomed to like Python, R, or notebooks like Jupyter. That should have these strengths:

  • Virtually unlimited scale
  • Flexibility of deployment – commodity hardware, cloud, hybrid, multi-cloud
  • Specialized analytics – geospatial, time series, text, etc.
  • Wide range of data format support to take advantage of strengths of each:
    1. Streaming data – real-time monitoring
    2. Semi-structured (JSON, Avro, CSV) – schema-on-read, flexibility for schema drift
    3. Hierarchical columnar (Parquet, ORC) – complex data types (maps, structs, arrays) – efficient long-term storage without losing query capability

Advantages of a Unified Analytics Warehouse

A unified analytics warehouse combines these strengths to make something even better than the sum of its parts, like a drum set is better than individual drums:

  • SQL query and business intelligence on full range of data – Streaming, semi-structured, hierarchical, and complex – including joining different kinds of data when needed to answer a question
  • High performance at virtually unlimited scale, even with many concurrent users or jobs, for both BI and data science
  • Low-latency response from streaming data and machine learning to meet even sub-second SLA’s, without costing an arm and a leg
  • On-prem deployments with a cloud backup for disaster recovery that can be up in minutes, or cloud deployments for a single company in different clouds for different regions
  • Data science with R or Python that is distributed for performance at scale, put into production as soon as it’s proven, with models managed as if they were tables
  • Data life cycle management – 
    1. hot data in a structured format optimized for high performance, cold data pushed to a more efficient long-term storage format that is still queryable, 
    2. ability to shift data from format to format easily when it makes sense, or query it in place if analyzing the format it’s in will meet SLA’s
  • Security, governance, reliability, even at extremely high scale
  • Ability to separate or isolate workloads to avoid resource conflicts, so training an ML model won’t slow down a dashboard drill-down, or an ad hoc query won’t keep a real-time model from reacting fast enough

The advantages are clear as a bell, and too numerous to put in just one article. Check out some of the EMA papers on the subject of the unified analytics warehouse if you want to get a better handle on it. In ten years, people are going to wonder how anyone got along with just a data lake or just a data warehouse. The combination is creating something unique and powerful as the heartbeat drum in your favorite song, and it’s revolutionizing high scale analytics across industries and across the world.

Leave a Reply

Your email address will not be published. Required fields are marked *