If 2020 and 2021 were the years of rapid cloud acceleration, 2022 will be the year when enterprises begin to get serious about bringing unstructured file data into cloud data… Read More »Getting File Data Ready for Cloud Data Lakes
If 2020 and 2021 were the years of rapid cloud acceleration, 2022 will be the year when enterprises begin to get serious about bringing unstructured file data into cloud data lakes. There are a few reasons behind this trend. First, organizations are sitting on petabytes of unstructured data, which comprise at least 80% of the 64 zetabytes of data (and growing) in storage worldwide today. Most of this is file data–from medical images to streaming video, sensor data from electric cars and IoT products and the documents people use in every sector to collaborate and do business.
Second, file data is becoming unmanageable, costly to store and CIOs know they are sitting on a potential gold mine of insights if only they could determine how to get it into the right places for analysis. Finally, the major cloud platforms are investing heavily in data analytics/ML/AI tools and lower-cost object storage tiers to support data lake projects.
See also: Data Lakes, Time-Series Data, and Industrial Analytics
The maturing of data lakes to the cloud
Enabling data lakes is one of the top goals that IT managers are prioritizing, along with security, cost management and visibility, according to the a recent study we carried out. The cloud has upended traditional data lake strategies, which began when companies wanted to analyze semi-structured data such as CSV and log files. In 2006, Hadoop was born, and gained widespread adoption just at the time when Big Data conversations were beginning to circulate. Yet Hadoop eventually proved to be slower and more expensive than expected, complicated to set up, scale and manage and primarily designed for batch processing. To solve these issues, Apache Spark entered the scene, running up to 100x faster for some workloads and well-suited for real-time analysis. Importantly, the focus of companies like Databricks was to run Spark in the cloud, whereas Hadoop was primarily implemented on-premises.
In the past few years, cloud-based data lake platforms have matured and are now ready for prime time. Cloud providers’ cheaper scale-out object storage delivers a platform for massive, petabyte scale projects that simply isn’t viable on-premises. Next-generation data lakes are built on Apache Spark to support S3 or object data storage, making it possible to ingest and process semi-structured and unstructured data. File storage is also transitioning to the cloud and needs to be leveraged as part of a cloud data lake, so all data may not be in object storage.
A cloud data lake strategy is a natural evolution for data-heavy enterprise IT organizations moving to the cloud, as it elevates the cloud from a cheap data storage locker to a place where data can be leveraged for new value, and monetized.
How to tame the cloud data lake
While these are still early days for cloud data lakes, including file data in your data lake is imperative, as machine learning models require large quantities of it to generate meaningful results. Yet this unstructured data isn’t standardized between file types: video files, audio files, sensor data, logs don’t share a common structure. And dumping all this file data willy nilly into the cloud data lake platform is not a sage strategy, but a mess to clean up later. Despite their promise, there are many risks with data lakes, ranging from high management costs, skill gaps, security and governance concerns, portability issues when moving data in between clouds and storage platforms and the longstanding worry of the data lake becoming a swamp when data gets too big and tangled to search and analyze.
Here are some considerations when embarking on bringing file data into a cloud data lake to avoid or minimize the strife.
Cloud data lakes have gained popularity because data can be ingested in its native format without the extensive pre-processing needed for data warehouses. The flip side is that data lakes have become data swamps particularly for unstructured file data, as this data has no common structure. Analyzing file data is becoming more critical with the rise in AI/ML engines which rely on it. Cloud data lakes can be optimized for unstructured data without destroying their appeal of ingesting data in native format by automating the indexing, search, collection, and optimization of file data.
Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved
Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.