The Long Due Renewal of Data Storage for Accelerated Data Applications

A unified storage solution holds the key to a transformed data management experience. Discover why data storage is a mess and how to fix it.

We need to end the disaster of disparate storage in data engineering. We’ve been so used to the complexities of multiple data sources and stores as isolated units, we consider it part of the process. Ask a Data Engineer what their usual day looks like, and you’ll get an elaborate spiel on overwhelming pipelines with a significant chunk dedicated to integrating and maintaining multiple data sources.

Having a single isolated lake is far from the desired solution. It eventually results in the same jumbled pipelines and swampy data- the two distinct nightmares of a data engineer. There is a dire need to rethink storage as a unified solution. Declaratively interoperable, encompassing disparate sources into a single unit with embedded governance, and independent to the furthest degree.

The Expensive Challenges of Data Storage

Let’s all agree based on historical evidence, disruptive transformations end up costing us more time and resources than the theoretical plan and given the data domain is mined with rapidly evolving innovations, yet another disruption with distant promises is not ideally favourable for practical reasons.

So instead, let’s approach the problem with a product mindset to optimise storage evolution- what are the most expensive challenges of storage today, and how can they be pushed back non-disruptively?

  • ELT Bottlenecks: Access to data is very restrictive in prevalent data stacks, and a central engineering team has to step in as a mediator, even for minor tasks. The transformation necessary to make your data useful creates multiple dependencies on central engineering teams, who become high-friction mediators between data producers and consumers. Engineers burdened with open tickets in not uncommon, indefinitely stuck in patching up countless faulty pipelines.
  • Isolation of storage as a standalone unit Storage so far has only been treated in silos. Data is optimised specifically for applications or engines, which severely limits its usefulness. Analytical engines that it doesn’t accommodate suffer from gruelling speed and expensive queries. Isolated sources also mean writing complex pipelines and workflows to integrate several endpoints. Maintaining the mass sucks out most of the engineer’s time, leaving next to no time for innovating profitable projects and solutions.
  • Cognitive Overload and Poor Developer Experience The consequence of the above two challenges is the extensive cognitive load which translates to sleepless nights for data engineers. Data engineers are not just restricted to plumbing jobs that deal with integrating countless point tools around data storage, but they are also required to repeatedly maintain and relearn the dynamic philosophies and patterns of multiple integrations.
  • Dark Data The high-friction processes around data storage make accessing data at scale extremely challenging. Therefore, even if the organisation owns extremely rich data from valuable sources, they are unable to use it to power its business. This is called dark data- data with plausible potential but with no utilisation.

The Necessary Paradigm: Unification of Data Storage

Storage as a Unified Resource

If storage continues to be dealt with as scattered points in the data stack, given the rising complexity of pipelines and the growth spurt of data, the situation would escalate into broken, heavy, expensive, and inaccessible storage.

The most logical next step to resolve this is a Unified Storage paradigm. This means a single storage port that easily interoperates with other data management tools and capability layers.

Unified Access

Instead of disparate data sources to maintain and process, imagine the simplicity if there was just one point of management. No complication while querying, transforming, or analyzing because there’s only one standardized connection that needs to be established instead of multiple integrations, each with a different access pattern, philosophy, or optimization requirement.

  • Homogenous Access Plane: In prevalent storage systems, we deal with heterogeneous access ports with overwhelming pipelines, integration and access patterns, and optimization requirements. Data is almost always moved (expensive) to common native storage to enable fundamental capabilities such as discoverability, governance, quality assessment, and analytics. And the movement is not a one-time task, given the rate of change of data across the original sources.

    In contrast, Unified Storage brings in a homogenous access plane. Both native and external sources appear as part of one common storage – a single access point to querying, analyzing, or transforming jobs. So even while the data sits at disparate sources, the user accesses it as if from one source with no data movement or duplication.
  • Storage-Agnostic Analytics: In prevalent systems, analytical engines are storage-dependent and data needs to be optimized specifically for the engines or applications. Without optimization, queries slow down to a grueling pace, directly impacting analytical jobs and business decisions. Often, data is optimized for a single application or engine, which makes it challenging for other applications to consume that data.

    In contrast, with the unification of storage as a resource, data is optimized for consumption across engines (SQL, Spark, etc.), enabling analytical engines to become storage-agnostic. For example, DataOS, which uses Unified Storage as an architectural building block, has two query engines – Minerva (static query engine) and Themis (elastic query engine), designed for two distinct purposes, and multiple data processing engines, including Spark. However, all the engines are able to interact with the Unified Storage optimally without any need for specific optimization or special drivers and engines.
  • Universal Discoverability: In prevalent data stacks, data is only partially discoverable within the bounds of the data source. Users have an obligation to navigate to specific sources or move data to a central location for universal discovery.

    On the contrary, Unified Storage is able to present virtualized views of data across heterogeneous sources. Due to metadata extraction from each integrated source, native or external, users are able to design very specific queries and get results from across their data ecosystem. They can look for data products, tables, topics, views, and much more without moving or duplicating data.
Unification through Modularization

Most organizations, especially those at considerable scale, have the need to create multiple storage units, even natively, for separation of concerns, such as for different projects, locations, etc. This means building and maintaining storage from scratch for every new requirement. Over time, these separate units evolve into divergent patterns and philosophies.

Instead, with Unified Storage, users create modular instances of storage with separately provisioned infrastructure, policies, and quality checks- all under the hood of one common storage with a single point of management. Even large organizations would have one storage resource to master and maintain, severely cutting down the cognitive load. Each instance of storage does not require bottom-up build or maintenance, just a single spec input to spin it up over a pre-existing foundation.

Visual representation of provisioning instances of unified stream and lakehouse within one common storage. Fastbase and Icebase are implementations of the unified stream and unified lakehouse, respectively.
Visual representation of provisioning instances of unified stream and lakehouse within one common storage. Fastbase and Icebase are implementations of the unified stream and unified lakehouse, respectively.
Unified Streaming

In existing storage constructs, streaming is often considered a separate channel and comes with its own infrastructure and tooling. There is no leeway to consume stream data through a single unified channel. The user is required to integrate disparate streams and optimize them for consumption separately. Each production and consumption point must adhere to the specific tooling requirements.

Unified streaming allows direct connection and automated collection from any stream, simplifies transformation logic through standardized formats, and easily integrates historical or batch data with streams for analysis. An implementation of unified storage, such as Fastbase, supports Apache Pulsar format for streaming workloads. Pulsar offers a “unified messaging model” that combines the best features of traditional messaging systems like RabbitMQ and pub-sub (publish-subscribe) event streaming platforms like Apache Kafka.

Unified Data Governance and Observability

We often see in existing data stacks how complex it is to govern data and data citizens on top of data management. This is because the current data stack is a giant web of countless tools and processes, and there is no way to infuse propagation of declarative governance and quality through such siloed systems. Let’s take a look at how a Unified Architecture approaches this.

Gateway communicates with the governance engine to get user tag information, the observability engine to get quality information, the infrastructure orchestrator to get cluster information, and Catalog for metadata.
Gateway communicates with the governance engine to get user tag information, the observability engine to get quality information, the infrastructure orchestrator to get cluster information, and Catalog for metadata.
  • Governance: In existing storage paradigms, Governance is undoubtedly an afterthought. It is rolled out per source and is restricted to native points. Additionally, there is an overload of credentials to manage for disparate data sources. Siloed storage is also not programmed to work with standard semantic tags for governance. Both masking and access policies for users or machines must be applied in silos.

    In contrast, Unified Storage comes with inherent governance, which is automatically embedded for data as soon as it’s composed into the unified storage resource, irrespective of native or external locations. Abstracted credential management allows users to “access once, access forever” until access is revoked. Universal semantic tags enforce policies across all data sources, native or external. Implementation of tags at one point (say, the catalog), propagates the tag across the entire data stack.
  • Observability & Quality: Much like Governance, Observability has also remained an afterthought, rolled out at the tail end of data management processes, most often as an optional capability. Moreover, observability too exists in silos, separately enabled for every source. Prevalent storage constructs have no embedded quality or hygiene checks which is a major challenge when data is managed at scale. If something faults, it eats up resources over an unnecessarily complex route to root cause analysis (RCA).

Unified storage has the ability to embed storage as an inherent feature due to high interoperability and metadata exposure from the storage resources and downstream access points such as analytical engines. Storage is no longer a siloed organ, but right in the middle of the core spine, aware of transmissions across the data stack. This awareness is key to speedy RCA, and, therefore, better experience for both data producers and consumers.

Enabling the Unified Storage Paradigm

Within the scope of this section, we’ll cover how we’ve established the unification of storage-as-a-resource through some of our native development initiatives.

Capabilities on top of Standard Lakehouse Format

Icebase is a simple and robust cloud-agnostic lakehouse, built to empower modern data development. It is manifested using the Iceberg table format atop Parquet files inside an object store like Azure Data Lake, Google Cloud Storage, or Amazon S3. It provides the necessary tooling and integrations to manage data and metadata simply and efficiently, as well as inherent interoperability with capabilities spread across layers inside the unified architecture of the data operating system.

While Icebase manages OLAP data, Fastbase supports Apache Pulsar format for streaming workloads. Pulsar offers a “unified messaging model” that combines the best features of traditional messaging systems like RabbitMQ and pub-sub (publish-subscribe) event streaming platforms like Apache Kafka.

Users can provision multiple instances of Icebase and Fastbase for their OLAP, streaming, and real-time data workloads. Let’s look at how these storage modules solve the pressing problems of data development that we discussed before.

Unified Access and Analytics

A developer needs to avoid the hassle of pushing credentials into source systems or raising tickets for access every time before starting to engineer a new asset. At the same time, having robust and enforceable policies in place is necessary to prevent breaches in the security of both raw data and data products. Two resources within a data operating system – Depot and Policy – come together with Icebase to manage a healthy tradeoff between access and security.

To add to this, Minerva – the querying engine of the DataOS – brings together unified access and on-demand computing to enable both technical & non-technical users to achieve outcomes without concern about the heterogeneity of their current data landscape, thus significantly reducing iterations with IT.

Access and Addressability

The unified storage architecture allows you to connect and access data from managed and unmanaged object storage by abstracting out various protocols and complexities of the source systems into a common taxonomy and route. This abstraction is achieved with the help of “Depots”, which can be comprehended as a registration for data locations to make them systematically available to the wider data stack.

A depot assigns a unique address to every source system in a uniform format. This is known as the Universal Data Link (UDL), and it allows direct access to datasets without having to specify credentials again. dataos://[depot]:[collection]/[dataset]

In Pulsar, topics are the named channels for transmitting messages from producers to consumers. Taking this into account, any Fastbase UDL (Universal Data Link) within the DataOS is referenced as dataos://fastbase:<schema_name>/<dataset_name>, where dataset is a pulsar topic. Similarly, an example Icebase UDL will look like dataos://icebase:retail/city

Depots have default access policies built in to ensure secure access for everyone in the organization. It is also possible to create custom access policies to access the depot and data policies for specific datasets within it. This is only for accessing the data, not moving or copying it.

Being completely built on open standards, extending interoperability to non-native components outside the operating system is possible through standard APIs.

Connection to a depot opens up access to all the capability layers inside a unified data architecture, including governance, observability, and discoverability. The depot is managed by a Depot Service which provides:

  • A single-point connection with analytics, observability, and discoverability layers
  • Access to several native stacks, engines, and data apps. These can be considered as consumers of a depot.
  • Pluggability with secret management systems, whether DataOS native or external. this helps you store credentials in the DataOS address itself. One can ask the stacks to use secrets with specific permissions e.g.

    dataos://[depot]:[collection]/[dataset]?acl=r | for read access
    or
    dataos://[depot]:[collection]/[dataset]?acl=rw | for write access
  • Data Definition Language (DDL) and Data Manipulation Language (DML) interfaces that power creating and managing (add/remove columns, partitions) capabilities on the data and metadata that are connected through the depot.

Analytics

Minerva is an interactive query engine based on Trino. It allows users to query data through a single platform across heterogeneous data sources without needing to know each database’s specific configuration, query language, or data format.

The unified data architecture enables users to create and tune multiple Minerva clusters as per the query workload, which helps in using your compute resources efficiently and scaling up or down depending on the demand flexibly. Minerva provides the ability to create and tune the necessary amount of compute.

Say we want to find the customer(s) who made the highest order quantity purchases for any given product and the discount given to them, for which the data has to be fetched from two different data sources – an Azure MS-SQL and a PostgreSQL database. First, we’ll create depots to access the two data sources and then a Minerva cluster. Once this is in place, we can seamlessly run queries across the databases, and tune our Minerva cluster according to the workload if necessary.

Creating Depots
Write the following YAML files to connect two different data sources to the Unified Storage resource.

Microsoft Azure SQL Database Depot

PostgreSQL Database

Use the apply command in the CLI to create the Depot resource

Creating the Query Cluster
To enhance the performance of your analytics workloads, choose appropriately sized clusters, or create a new dedicated cluster for your requirements. In this example, a new cluster is created.

Use the apply command to create the Cluster

Now, we can run the below query on two different data sources as if we are querying from a homogenous plane that hosts all tables from across these two sources. The results are shown here in Workbench, a native web-based analytics tool in DataOS.

Minerva (query engine) can perform query pushdowns for entire or parts of queries. In other words, it can push processing closer to the data sources for improved query performance.

Unified Governance and Management

‘Policy’ is an independent resource in a unified data architecture that defines rules of relationship between subjects (entities that wish to perform actions on a target), predicates (the action), and objects (a target). Its ABAC (attribute-based access control) implementation uses tags and conditions to define such rules, enabling coarse-grained role-based access to datasets and fine-grained access to individual items within the dataset.

Tags, which can be defined for datasets and users, are used as attributes. As soon as a new source or asset is created, policies pertaining to its tags are automatically applied to them. Changing the scope of the policy is as simple as adding a new attribute or removing an existing one from the policy. Special attributes can be written into the policy and assigned to a specific user if the need for exceptions arises. The attribute-based definition makes it possible to create fewer policies to manage the span of data within an organisation, while also needing minimal changes during policy updates.

For the purpose of enforcing policies across all data sources while accessing them, a Gateway Service sits like an abstraction above Minerva (query engine) clusters. Whenever a query is fired from any source (query tool, app, etc.), Gateway parses and formats the query before forwarding it to the Minerva cluster. After receiving the query, Minerva analyzes it and sends a decision request to Gateway for the governance to be applied. Gateway reverts with a decision based on user tags (received from the governance engine) and data-policy definition (which it stores in its own database – Gateway DB).

Based on the decision, Minerva applies appropriate governance policy changes like filtering and/or masking (depending on the dataset) and sends the final result set to Gateway. The final output is then passed to the source where the query was initially requested.

Ease of Operability and Data Management

A unified storage architecture that provides advanced capabilities needs equally advanced maintenance. Thus, built-in capabilities to declaratively manage and maintain data and metadata files are a necessity.

In Icebase or the lakehouse construct, these operations can be performed using a “Workflow” – a declarative stack for defining and executing DAGs. The maintenance service in Workflow is offered through “actions“, which is defined in its own section within a YAML file. Below are examples of a few actions for seamless data management.

Rewrite Dataset
Having too many data files leads to a large amount of metadata in manifest files, while small data files result in less efficient queries and higher file open costs. This issue can be resolved through compaction. Utilizing the rewrite_dataset action, the data files can be compacted in parallel within Icebase. This process combines small files into larger ones, reducing metadata overhead and file open costs during runtime. Below is the definition for the rewrite_dataset action.

Expire Snapshots
Writing to an Iceberg table in an Icebase depot creates a new snapshot of the table, which can be used for time-travel queries. The table can also be rolled back to any valid snapshot. Snapshots accumulate until the expire_snapshots action expires them. It is recommended to regularly expire snapshots to delete unnecessary data files and keep the size of the table metadata small.

Garbage Management
While executing Workflows upon Icebase depots, job failures can leave files that are not referenced by table metadata, and in some cases, normal snapshot expiration may not be able to determine if a file is no longer needed and delete it. To clean up these ‘orphan’ files under a table location older than a specified timestamp, we can use the remove_orphans action.

Delete from Dataset
The delete_from_dataset action removes data from tables. The action accepts a filter provided in the deleteWhere property to match rows to delete. If the delete filter matches entire partitions of the table, Iceberg format within the Icebase depot will perform a metadata-only delete. If the filter matches individual rows of a table, then only the affected data files will be rewritten. The syntax of the delete_from_dataset action is provided below:

Time Travel

Iceberg generates Snapshots every time a table is created or modified. To time travel, one needs to be able to list and operate over these snapshots. The lakehouse should provide simple tooling for the same. The Data Toolbox provides the functionality for metadata updating through the set_version action, using which one can update the metadata version to the latest or any specific version. This is done in two steps –

  1. First, a YAML workflow is created, containing a Toolbox Job. A sample Toolbox Workflow is given below:
  2. The above workflow is then applied with the CLI

Using the CLI
The CLI on top of a unified data architecture can be used to work directly with Snapshots and metadata. For example, a list of snapshots can be obtained with the following command –

To travel back, we can set a snapshot at the desired timestamp.

One can also list all metadata files, and metadata can also be set to the latest or some specific version using single-line commands.

Partitioning and Partition Evolution

Partitioning is a way to make queries faster by grouping similar rows together when writing. We can use both the Workflow stack through a YAML file and the CLI to work with Partitions in Icebase. The following types of Partitions are supported: timestamp (year, month, day and hour) and identity (string and integer values).

You can partition by timestamp, identity, or even create nested partitions. Below is an example of partitioning by identity.

If the partition field type is identity type, the property ‘name’ is not needed.

Workflow:

CLI:

Thanks to Iceberg, when a partition spec is evolved, the old data written with an earlier partition key remains unchanged, and its metadata remains unaffected. New data is written using the new partition key in a new layout. Metadata for each of the partition versions is kept separately. When you query, each partition layout’s respective metadata is used to identify the files it needs to access; this is called split planning.

Say, we are working on a table “NY Taxi Data”. The NY Taxi data has been ingested and is partitioned by year. When the new data is appended, the table is updated to partition the data by day. Both partitioning layouts can co-exist in the same table. The query need not contain modifications for the different layouts.

Summary

In essence, let’s summarize with the understanding that siloed storage is a deal breaker when it comes to data management practices. In the article, we saw how data stored, accessed, and processed in siloes creates a huge cognitive overload for data developers as well as eats up massive resources. On the contrary, a unified storage paradigm bridges the gap between different capabilities essential to the data stack and helps users produce and consume data as if data were one big connective tissue. Users can access data through standardized access points, analyze it optimally without fretting about specific engines and drivers, and most essentially, rely on that data that arrives on their plate with embedded governance and quality essentials.

Leave a Reply

Your email address will not be published. Required fields are marked *