In the past, the world of business intelligence was relatively uncomplicated. Organizations primarily focused on extracting structured, well-organized data from their enterprise systems and conducted data analytics, reporting, and querying on this data. However, modern organizations are grappling with a completely transformed landscape. They are now dealing with an array of new data types drawn from various industries, including data from smart sensors, IoT devices, customer mobile phones, and more. Furthermore, thanks to APIs and other technological advancements, organizations increasingly incorporate third-party data into their analytics, querying, and reporting processes. This type of data often requires substantial preparation to make it valuable and reliable.
Moreover, the advent of new technologies has expanded and streamlined these processes. Many organizations now employ advanced techniques such as predictive analytics, machine learning, and artificial intelligence to delve deeper into their data and gain more precise insights. These workloads not only demand the use of a columnar store database, which reorganizes the data in preparation, but they also demand reorganizing the data in multiple ways into “data marts” that address different types of analysis.
The demands of varying data sources and varying workloads necessitate organizations to reassess their approach to ETL (Extract, Transform, Load) versus ELT (Extract, Load, Transform).
ETL vs. ELT Basics
ETL and ELT both encompass processes to bring together data from diverse sources for analysis, but the key distinction lies in the sequence of these steps. Choosing between them depends on the specific use case and the role of resources like cloud data warehouses.
In an ETL (Extract, Transform, Load) operation, data is initially extracted in its raw form from diverse sources, including databases and files. Subsequently, this data undergoes transformation to prepare it for storage and analysis. During this phase, the data might undergo reformatting, filtering, enrichment, aggregation, and more. After transformation, the data is loaded into a system that supports querying and analytics, such as a database or data warehouse.
With ELT (Extract, Load, Transform), the extraction phase is akin to ETL. However, raw data is loaded directly into a data lake or data warehouse. Transformation is subsequently performed within that system, leveraging the storage format as well as the parallel processing power of its cluster of servers. Analysis, reporting, and querying are then conducted within the same system.
ETL, in contrast to ELT, requires the addition of intermediate processing power during the transformation phase, which can be resource-intensive depending on the data volume. With the explosion of data size, many organizations lack the necessary computational infrastructure. ELT shifts the processing burden to the infrastructure of the target system, which is often cloud-based. This leverages on-demand compute scalability and distributed processing capabilities inherent in cloud environments.
What’s Changed? What’s New? And the Impact on ETL vs. ELT
Today’s organizations confront a plethora of data sources, data types, and an array of cloud platforms designed to store, analyze, and query data. These changes mandate a reevaluation of the choice between ELT and ETL. In the past, the decision was more straightforward, influenced by data modeling principles popularized by Ralph Kimball, particularly dimensional modeling.
Kimball introduced the data warehousing industry to dimensional modeling in 1996, emphasizing denormalization and star schema methodologies. Back then, analytical systems predominantly relied on transactional and relational databases that favored row-oriented storage. However, the landscape has evolved considerably.
Today, row storage is often insufficient, compelling businesses to transform data from row store to column store for efficient data analytics. This transformation is best performed in a data warehouse due to the bulk nature of these operations, necessitating dedicated data warehouses for such tasks. But batch loads from a transactional row store into Kimball’s star schema is only half the story. Many new data sources bypass the transactional row-store altogether and feed directly into a columnar store.
Time series data such as IoT, smart devices, and event logs are often streamed directly into the columnar store warehouse using message brokers such as Apache Kafka. This data is often highly denormalized with both fact and dimensional data in a single “wide table.” Furthermore, the data may include complex data types such as arrays and maps. Like the Kimball schema, wide table data is often transformed or enriched. The data is often processed directly in this wide-table format, particularly by machine learning models that consume complete rows and return classifications, predictions, or recommendations.
How OpenText Vertica Facilitates This
Different organizations and groups have diverse data requirements. Some may need comprehensive solutions for data management, data transformation, and analytics, while others, such as business units, may require the capability to work directly with data in databases or data warehouses. OpenText Vertica excels at both, offering technology for efficiently managing extensive data transformations and analysis.
Vertica was created through the innovative work of Michael Stonebraker, creator of the first relational database, Ingres, and later by the popular database, Postgres. Stonebraker recognized the significance of columnar stores for enhancing data retrieval speed. His C-Store paper, which advocated for read-optimized relational database management systems, laid the groundwork for the Vertica solution.
The Vertica Analytics Platform is a columnar, relational, ANSI-Standard SQL, and ACID-compliant database designed for demanding data analytics use cases. It supports in-database analytics with a wide array of built-in functions, including time series, event pattern matching, geospatial analysis, and in-database machine learning. Vertica is not simply a data warehouse but a data lakehouse, also offering the ability to query data in-place, such as in a data lake.
One of Vertica’s standout features is its approach to “data marts.” The data mart is a purpose-driven copy of the warehouse for the purpose of addressing a particular type of workload. Whereas a warehouse may have stored data in date sequence, which is great for processing date-oriented queries, date-sequenced data is not efficient at processing account-oriented queries. In such a case, a DBA would typically create a data mart having the same data resequenced by account, thus efficient at processing account-oriented queries. Creating such a data mart requires the creation and maintenance of a second set of transformations (like a mini ETL). This additional work and maintenance are avoided with Vertica’s “projections.”
Projections are partial or complete copies of the data set, sorted to expedite queries based on specific predicates such as date-sequenced vs account-sequenced. What’s more, projections are transparent to the user; Vertica automatically selects the best projection. Projections enhance query performance by eliminating additional sorts and reducing network traffic. Vertica’s columnar storage and compression techniques further reduce network traffic and disk I/O.
Combined, these features enable Vertica to do more with less.
In other words, Vertica is very well suited for the varied sources of data and increased processing demands of enterprise analytics and machine learning workloads. What’s more, Vertica includes a host of machine learning algorithms to automate data-driven decision-making. Most models are written in C++ and designed to take advantage of parallel execution across the cluster – in other words, they run at blazing speeds.
In summary, the changing data landscape has shifted the ETL vs. ELT conversation. Modern data requirements call for efficient columnar databases and cloud-based analytics platforms. OpenText Vertica, with its innovative features and capabilities, is ideally suited to address the complex needs of today’s enterprise organization.