SHARE

Apache Iceberg Quickly Becoming Large-Scale Analytics Format Of Choice

Apache Iceberg enables multiple applications to process the same dataset and to understand the metadata inside each table.

Written By

DC

David Curry

Jan 10, 2023

2 minute read

*Apache Iceberg enables multiple applications to process the same dataset and to understand the metadata inside each table.*

Apache Iceberg, which was born out of a love/hate relationship that many Netflix engineers had with data warehouse software Apache Hive, has in the space of five years become the go-to choice for developers working on large-scale analytics tables.

In 2015, Netflix engineers Ryan Blue and Daniel Weeks began work on Iceberg as a solution to many of the issues developers had with Apache Hive, which was heavily integrated into Netflix infrastructure. The problems had become so commonplace that engineers routinely avoided using Hive services, inputting data manually instead, which led to slower productivity.

With Iceberg, Netflix aimed to ensure the correctness and validity of data transactions, regardless of errors, power failures, and other issues that can occur at the processing stage. It also wanted to improve the performance of table software through the use of finer-grained operations, allowing analysis to be done at the file level, and simplify the operation and maintenance of tables.

The team succeeded in this task, with Netflix shifting much of their operations to Iceberg. A year after publishing Iceberg, the team donated the project to Apache Software Foundation, and launched their own data automation platform for data warehouse storage, called Tabular.

In the years since donating Iceberg to Apache, it has been adopted by a long list of major tech companies, including Adobe, Airbnb, Apple, Google, LinkedIn, Snapchat, and Snowflake. Many of them are prioritizing Apache Iceberg over other formats, with Google getting feedback from a lot of Cloud customers on why Iceberg should be of higher priority than Databricks Delta and Hudi, two alternative big data analytics formats.

Google has retained the availability of all three formats for the time being, although Sudhir Hasbe, a senior director of product management at Google Cloud, confirmed to The Register that Apache Iceberg was becoming the “primary format”. Cloudera and Snowflake also announced support for Iceberg in the past two years, with signs of moving away from other formats in the future.

See also: 22 Top Cloud Database Vendors

There are plenty of benefits to Apache Iceberg. Many developers of analytics applications cite the vendor and platform agnostic approach, provided by the Apache Foundation, as of value in comparison to Databricks and other formats, which are not as widely supported. As for the application itself, Apache Iceberg enables multiple applications to process the same dataset and to understand the metadata inside each table, and any updates to massive data lake tables are processed at a much faster rate than other formats. Apache Iceberg also has improvements to data management and reliability, with better identification and resolution of issues inside the tables.

DC