Right now, the world of analytical data engineering and data architecture is awash with confusion and controversy about how we should handle data for analytics. A lot of the controversy centers on which is better, a data warehouse architecture, or a data lake architecture, or some combination of the two.
I think we’re all asking the wrong question.
Analytics end users don’t care where or how data is stored.
Executives, business analysts, data scientists, even line-of-business workers – they’re interested in the analytics, but not as much in the data.
As data engineers and architects, we have to care where the data is stored, how it gets there, how it gets clean, and managed to feed into analytics. We have to worry about real-time pipelines, historical storage, and combining incoming time-series sensor data with geographic data about the weather at that timestamp.
However, the people driving the business, the people who write our paychecks, don’t care. As a profession, we need to stop forcing them to worry about where their data resides.
Unify the analytics, not the data
Once upon a time, data warehouse architectures were designed to gather data, combine it, polish it, and present it to visualization tools that showed everyone how the business was performing. Business analysts put in SQL queries as needed.
Then, along came Doug Laney’s three V’s –massive increases in data volume, velocity, and variety, including streaming real-time data from devices that, in many ways, encompassed all three. Also, new people called data scientists, who looked a lot like our old quants, statisticians, and actuaries, needed all that data to do sophisticated predictive analytics and machine learning.
The data lake was touted as the solution. Dump everything here and do analytics on top of that crazy mess. It’ll be great.
But it wasn’t so great.
Governance wasn’t there; security wasn’t there. Most importantly to end users, concurrency and response times weren’t there. The architecture could no longer support all the people who wanted to perform analytics, much less expand to allow more people in the company to use data to drive their decisions. Nor could it provide analytical answers at the speed they wanted to ask the questions, much less the speed of automation.
A data scientist training a model might cause the system to crash or bog down, causing a business analyst to miss their SLA, or reports to be generated only after the CEO went into a stockholder meeting, or an automated threshold not to be triggered and miss shutting off a valve before it spewed. Putting everything in one giant lake meant everyone competed for the same resources.
The brave new world poses a false choice
Now, we’ve got attempts by some of the folks who failed to deliver on the promises of the data lake trying to convince us that they’re going to add a few features from the old data warehouse, call it some silly new name, and ta-da, all is solved.
We’ve got data architects struggling to find workarounds. They’re building complex combination architectures that include both a data warehouse and a data lake. Then, depending on who the end user is, they tell them where to find their data, what condition it’s in, force them to fetch data from several different locations in different formats, and let the consumers figure out how to analyze it.
Folks are arguing whether they should do everything streaming first, if they should use only open source software, or only proprietary, or only one brand of proprietary software. Should they put all the data on a cloud?
They’re missing the point.
Depending on who you ask, data scientists end up spending 60 to 90 percent of their time combining and cleaning data to get it ready for analytics – which is something data engineers and data architects get paid to do.
What’s more, business analysts and dashboard consumers really don’t care if the architecture is built all on the cloud, or if it’s from a proudly open-source shop. They genuinely don’t care where you put their data any more than an Amazon shopper cares what warehouse the retail giant stored their product in.
Would you like to drive to a particular Amazon warehouse, find your product, put it in a box, and drive it home yourself? Similarly, analytics consumers don’t want you to just tell them where the data is and wish them luck.
So, what do analytics consumers really need?
- Ease of Use – How hard is it going to be to get the analysis I need?
- Accuracy – Can I trust that the analysis will be accurate?
- Workload Isolation – Can I ask the analytical questions I need to ask without crashing the system or slowing down my boss’s dashboard?
- Concurrency – When I need access to analytics, can I get it, or will I have to wait in line?
- Response Speed – Am I going to get an analytical answer back fast enough to matter?
In other words, they care about the analytics.
How can data engineers and architects deliver better analytics?
Stop focusing on unifying the data storage and focus on unifying the analytics experience. You might think, “But processing and storing data is what a data engineer does.” That’s like saying, “Moving boxes is what an Amazon delivery person does.”
An Amazon delivery person needs to focus on making sure the right package is delivered to the right address within the stated delivery window. They need to know things like storage location, and packaging process, and best transportation route, but that’s not the focus.
The people designing and building data architectures should not be focused on where and how to transport, store, and process data. They have to be focused on how to serve analytics.
Architects need to work backward. Look first at what the analytics consumer needs. Analytics consumer requirements, and keeping costs reasonable, are the primary concerns.
This throws a lot of things out the window that you might have thought were important.
Open source or proprietary? Doesn’t matter. Choose what will do the job best and keep costs reasonable, not just software costs, but also maintenance, support, and operational costs.
Cloud, on-premises, hybrid, or something else? Doesn’t matter. Choose what will do the job best now, and expect it to change over time, so also prepare for the future.
Data warehouse, data lake, combination, or something else? Doesn’t matter. Choose architecture based on making analytics accessible, not on where data is stored, and be wary of vendors who insist that you must store all your data in their platform.
Focusing on analytic consumer needs sounds simple, but it’s a lot easier said than done.
What makes a solid data architecture?
Working backwards, start with the analytical consumer needs:
1) Ease of use
To an executive or line-of-business person, ease of use means a dashboard or report that shows them what they need to know quickly and understandably. It also means that if they click on that visualization dashboard to drill down on a particular region or important fact, they get back a new visualization quickly. From a data engineering perspective, that means solid integration with visualization software, a data querying engine that robustly supports full ANSI SQL since those visualization tools send some gnarly SQL queries, and fast performance.
To a business analyst, ease of use meansgenerating reports and building dashboards quickly and easily, without worrying about where the data they need is stored. It means sending an ad-hoc SQL query to get answers to questions they were just asked, without having to go back to an ETL or data engineering team and ask them to add a column of data that they left out before. And it means SQL again, the business analyst language of choice.
To a data scientist, ease of use means using familiar tools like Python, R, or a notebook like Jupyter. Since SQL is needed by other users, the flexibility to use different tools to access data is a key aspect of good architecture. Ease of use also means addressing the entire end-to-end data science workflow in one place without moving chunks of data somewhere else. It means quick, easy, complex data preparation operations like geo-fencing, or disparate time series data joins, or missing value interpolation. Training models should happen on a distributed system for speed and accuracy, without moving data or re-doing work. This includes not having the data engineering team re-do their work in a different framework to operationalize. The environment they develop on should be virtually identical to production to make that essential final jump to production as easy as possible. And it would be nice if they could manage model life cycles as well, without moving either the model or the data that trained it.
2) Accuracy
For the data scientist and the business analyst, who are building analytics, accuracy means knowing where to find the right data. But you don’t want to point them at a giant warehouse or lake, and say, “Go fishing.” They’ll need a specific inventory, so they know where to find exactly what they need.
For all analytics consumers, accurate analysis requires knowing where the data came from, knowing the data is clean and verified fit for use. Accuracy often comes down to data quality, data lineage, and data governance. If you thought you could let those slide in the age of big data, I’ve got some bad news. Clean, known data from the right source is just as important now as ever.
Accuracy also depends on the business analyst and the data scientist building good analytics. You might think that isn’t the data engineer’s problem. But, to some extent, it is.
For business analysts, a big part of building accurate analyses is having a complete picture from all relevant data sets. Some or most of the data with the rest off in a silo somewhere won’t give them a complete picture of the organization. Now, this may sound like the old story – move all the data to one place first. But providing access to all the relevant data sets doesn’t necessarily mean moving all the data to one place. Storage location doesn’t matter, but analytic access does.
For data scientists, a big part of building accurate analyses is having complete data sets. Machine learning requires a lot of data for training. More data even beats a better algorithm for increasing accuracy. Provide data scientists with access to the entire data set, no matter how big. Taking a small sub-sample that can fit in memory on a laptop, and building a model from that, is a recipe for reduced model accuracy, not to mention re-doing work. Focus on building an architecture where the only reason data scientists need to sample data is to separate out training and verification samples.
3) Workload isolation
This is a subject that hasn’t gotten the attention it deserves, especially when many big data vendors want you to shove your data all in one place first and foremost.
Business intelligence teams are the essential heart of many data-driven organizations, building reports for everyone to use, and answering questions as they come up with ad-hoc queries.
Data science teams need access to the same data, and in bursts, huge amounts of compute power to train models.
Executives and line-of-business people want to drill down on dashboards and get fast responses.
And what about data engineering? When and where are the data transformation jobs going to run if three other teams need those same resources?
If every group uses the same compute resources on the same data, there’s going to be some obvious conflicts. Isolating workloads from each other by providing dedicated compute resources and separate access to data can make your business a lot more harmonious. It can also provide each team with what it needs to do the job right.
The first thought most people have on this is making copies of the data for each team. That’s how data marts proliferated back in the day. The data inconsistencies and the constant need to update multiple locations makes that less than ideal. Don’t make yourself crazy, trying to build that spiderweb of pipelines.
These days, we have a better option – cheap shared storage in HDFS, S3, etc. Cloud computing has the concept of spinning up sub-clusters. Whatever data is needed is copied from communal storage, and whatever compute is needed for that particular job or team is ephemerally assigned just to them and no one else. The beauty of the sub-cluster concept is that it isn’t just a cloud thing. HDFS or S3 style shared storage options are available on-premises as well.
Sub-clustering to isolate workloads makes sense and doesn’t tie you to one deployment option, as you might think. There are other ways to isolate workloads, but sub-clustering is a really good one.
4) Concurrency
Concurrency is pretty straightforward. If the goal is for more aspects of a business to be data-driven, you have to provide access to data analytics to more people. Make sure your data analytics architecture can support everyone who can benefit from it. Don’t think you’re doing the organization any favors by using a cheaper option if it unreasonably limits the number of people who can use it.
5) Response speed
From an architecture perspective, analytical response speed comes down to the performance of whatever engine you’re using to do the analysis. Concurrency matters, too, though. Some analytic technologies have great response speed until more than ten people use it at the same time, then performance drops like a rock.
And, of course, your SLA matters. For some situations, getting an answer back in an hour is great. For others, three seconds is too long.
You may need to do things like train a model in one place on a large historical data set, then deploy it out to the edge where it can detect a pattern, and react to it in sub-second time frames. The flexibility to meet various demands is a big concept to keep in mind.
Unify the analytics, not the data
These days – with data from devices, data from transactional systems, data from external sources, structured data, unstructured data, complex hierarchical data – the data landscape is far too complex for moving all the data to one place to be practical. Instead of focusing on where the data lives, focus on making the analytics experience as smooth as possible for everyone in your organization.
Put those packages of analytics right on consumers’ doorsteps.