Apache Hudi: How Uber Gets Data a Ride to its Destination

At a busy, data-intensive enterprise such as Uber, the volumes of real-time data that need to move through its systems on a minute-by-minute basis reaches epic proportions. This calls for… Read More »Apache Hudi: How Uber Gets Data a Ride to its Destination

Written By
thumbnail
Joe McKendrick
Joe McKendrick
Apr 20, 2022

At a busy, data-intensive enterprise such as Uber, the volumes of real-time data that need to move through its systems on a minute-by-minute basis reaches epic proportions. This calls for a data lake extraordinaire, in which data can immediately be extracted and leveraged across a range of functions, from back-end business applications to front-end mobile apps. Uber depends on up-to-the-minute bookings and alerts as part of its appeal to customers, so its reliance on real-time data streaming platforms is off-the-charts. It has turned to Apache Hudi, an emerging platform that brings stream processing to big data, providing fresh data while being an order of magnitude efficient over traditional batch processing.

I recently had the opportunity to moderate a webcast about Apache Hudi with Nishith Agarwal and Sivabalan Narayanan, both engineers with Uber. Both Agarwal and Narayanan are active members of the Hudi programming committee.

The Hudi data lake project was originally developed at Uber in 2016, open-sourced in 2017, and submitted to the Apache Incubator in January 2019. Apache Hudi data lake technology enables stream processing on top of Apache Hadoop compatible cloud stores and distributed file systems. The solution provides tools to ingest data onto HDFS or cloud storage, as well as provide an incremental approach to resource-intensive ETL, Hive, or Spark jobs. It is designed to get data into the hands of users and analysts much quicker.

At Uber, “Hudi powers many different use cases,” says Agarwal, noting that the company’s enterprise data lake is built on Hudi. “We have about 250 petabytes of data that’s managed by the data lake platform. The kinds of use cases that it enables are, for example, whenever you build machine learning pipelines. One of the challenges are if data is changing upstream, and I want to update my feature set, how do I update my feature set without actually reading the entire data and re-snapshotting it? That becomes a really costly process. For example, if we run the data models for UberEats, which are massive, hundreds and hundreds of terabytes and consuming that data becomes tricky. One of the ways where Hudi is being employed is to make all of this incremental, with all of these primitives.”

Another use case is around managing earnings data, Argawal continues. “As we go through all of the business use cases that Uber has, exposing different data to different customers to different users, how do we do that in an efficient way? How do you point out exactly where the data lies and then be able to expose this data again to the record level, indexing all of these things? Hudi helps immensely in those kinds of use cases.”

Going forward, Argawal anticipates tighter integration with other streaming platforms such as Kafka. “Generally, Hudi will connect to Kafka directly and pull streams. Kafka Streams itself is also an execution framework, like Apache Fling, but has some custom semantics, and right now, there is no support for running Hudi on Kafka Streams, but we are looking at providing connectors that may be able to do that.”

thumbnail
Joe McKendrick

Joe McKendrick is RTInsights Industry Editor. He is a regular contributor to Forbes on digital, cloud and Big Data topics. He served on the organizing committee for the recent IEEE International Conference on Edge Computing (full bio). Follow him on Twitter @joemckendrick.

Recommended for you...

7 Key Considerations for Choosing Container Base Images for Java Apps
Dmitry Chuyko
Mar 11, 2026
The Manual Migration Trap: Why 70% of Data Warehouse Modernization Projects Exceed Budget or Fail
The Role of Data Governance in ERP Systems
Sandip Roy
Nov 28, 2025
2025 Cloud Database Market: The Year in Review
CDInsights Team
Nov 13, 2025

Featured Resources from RT Insights

7 Key Considerations for Choosing Container Base Images for Java Apps
Dmitry Chuyko
Mar 11, 2026
When AI SRE Meets Production Reality
Snir Amsalem
Feb 28, 2026
Quantum Computing as a Service: Bringing Qubits into the Enterprise Cloud
Best Practices for Balancing Container Security with Operational Efficiency
Dmitry Chuyko
Feb 8, 2026
Cloud Data Insights Logo

Cloud Data Insights is a blog that provides insights into the latest trends and developments in the cloud data space. We cover topics related to cloud data management, data analytics, data engineering, and data science.

Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.