Getting File Data Ready for Cloud Data Lakes

If 2020 and 2021 were the years of rapid cloud acceleration, 2022 will be the year when enterprises begin to get serious about bringing unstructured file data into cloud data lakes. There are a few reasons behind this trend. First, organizations are sitting on petabytes of unstructured data, which comprise at least 80% of the 64 zetabytes of data (and growing) in storage worldwide today. Most of this is file data–from medical images to streaming video, sensor data from electric cars and IoT products and the documents people use in every sector to collaborate and do business.

Second, file data is becoming unmanageable, costly to store and CIOs know they are sitting on a potential gold mine of insights if only they could determine how to get it into the right places for analysis. Finally, the major cloud platforms are investing heavily in data analytics/ML/AI tools and lower-cost object storage tiers to support data lake projects.

The maturing of data lakes to the cloud

Enabling data lakes is one of the top goals that IT managers are prioritizing, along with security, cost management and visibility, according to the a recent study we carried out. The cloud has upended traditional data lake strategies, which began when companies wanted to analyze semi-structured data such as CSV and log files. In 2006, Hadoop was born, and gained widespread adoption just at the time when Big Data conversations were beginning to circulate. Yet Hadoop eventually proved to be slower and more expensive than expected, complicated to set up, scale and manage and primarily designed for batch processing. To solve these issues, Apache Spark entered the scene, running up to 100x faster for some workloads and well-suited for real-time analysis. Importantly, the focus of companies like Databricks was to run Spark in the cloud, whereas Hadoop was primarily implemented on-premises.

In the past few years, cloud-based data lake platforms have matured and are now ready for prime time. Cloud providers’ cheaper scale-out object storage delivers a platform for massive, petabyte scale projects that simply isn’t viable on-premises. Next-generation data lakes are built on Apache Spark to support S3 or object data storage, making it possible to ingest and process semi-structured and unstructured data. File storage is also transitioning to the cloud and needs to be leveraged as part of a cloud data lake, so all data may not be in object storage.

A cloud data lake strategy is a natural evolution for data-heavy enterprise IT organizations moving to the cloud, as it elevates the cloud from a cheap data storage locker to a place where data can be leveraged for new value, and monetized.

How to tame the cloud data lake

While these are still early days for cloud data lakes, including file data in your data lake is imperative, as machine learning models require large quantities of it to generate meaningful results. Yet this unstructured data isn’t standardized between file types: video files, audio files, sensor data, logs don’t share a common structure. And dumping all this file data willy nilly into the cloud data lake platform is not a sage strategy, but a mess to clean up later. Despite their promise, there are many risks with data lakes, ranging from high management costs, skill gaps, security and governance concerns, portability issues when moving data in between clouds and storage platforms and the longstanding worry of the data lake becoming a swamp when data gets too big and tangled to search and analyze.

Here are some considerations when embarking on bringing file data into a cloud data lake to avoid or minimize the strife.

Optimize the data lake. Before any data can be analyzed it must be cleansed, normalized and classified, which can be a highly manual process contributing to cost overruns and slow time to value. This has always been a challenge for a data warehouse initiative and the same applies to data lakes and data lakehouses. Data lakes are appealing because they can ingest data in their native format; requiring optimization before putting data into the lake destroys this ease-of-use. How can you automatically optimize file data without requiring a change to user behavior? The key to optimizing file data is the metadata: the information on file types, dates created and last accessed, owners, projects and location. The ability to automatically index and tag files on metadata properties will avoid data swamp issues and make it easier to search and segment later, as opposed to just leaving data lakes unmanaged.
Use metadata indexing to find precise data sets for specific needs. Tools that can index files and search metadata across storage (including on-premises, edge and cloud locations) can narrow billions of files down to a few thousand so that you are only sending the precise files you want to analyze to the cloud.
Tag data as you go for improved searchability and usability. Once you find the files you need, you can then use a machine learning system to further refine the search with more tags. This process must be continuous and automated, so over time additional structure is developed and easier searchability comes to your data lake along with higher quality overall.
Accommodate the edge. As edge computing grows due to new use cases from sensor data, streaming data from the edge is going to become untenable. How can you process more data at the edge and take just what you need into a cloud data lake? Edge pre-processing will become more critical as edge data volumes grow.
Create taxonomies by industry. There is no standard tagging nomenclature for each industry. Having some common tagging classifications by sector will make data easier to search and extract, especially in collaborative environments such as research and life sciences.
Address data mobility. To be truly mobile, data should be able to reside in different systems across hybrid cloud environments while also natively access the services in those environments. Unlocking data from proprietary storage systems gives control back to IT and eliminates fees and hassles of moving data from one platform to the next. The way data is used and accessed and its value changes over time. By future-proofing your data you can adapt to change and new requirements. Independent data mobility and management solutions can help here.
Build the right culture. Leading IT organizations continue to identify culture – people, process, organization, change management – as the biggest impediment to becoming data-driven organizations, according to 2021 research by New Vantage Partners. A data-driven culture needs to span not just the analysts and the lines of business but IT infrastructure teams. IT leaders will need to play a role in helping data storage, server and networking professionals re-orient their responsibilities and daily tasks toward a data-centric decision-making framework. Tools and processes should be cross-functional, allowing for a holistic view of the organization’s data assets and collaboration around strategies for managing those assets for organizational gain.

Cloud data lakes have gained popularity because data can be ingested in its native format without the extensive pre-processing needed for data warehouses. The flip side is that data lakes have become data swamps particularly for unstructured file data, as this data has no common structure. Analyzing file data is becoming more critical with the rise in AI/ML engines which rely on it. Cloud data lakes can be optimized for unstructured data without destroying their appeal of ingesting data in native format by automating the indexing, search, collection, and optimization of file data.