Best Practices for Lakehouse Operations on Amazon S3

*Steps include selecting a storage strategy, migrating data to cloud, and optimizing performance.*

The process of building a data lakehouse can be quite daunting, especially for businesses which are making their first transition to the cloud. The term data lakehouse was first coined by cloud data provider Databricks, as a portmanteau of data warehouse and data lake, with some of the better features and capabilities of each one.

In a session on operating a lakehouse on Amazon S3 at Subsurface Data Lakehouse Conference, AWS analytics specialist Jorge Lopez and senior product manager Huey Han provided the best practices for building and managing a lakehouse.

When preparing a data lakehouse strategy, the first decision to make is where to store the data. There are multiple dimensions to this decision, such as storage and scalability needs, data governance and encryption services, and ecosystem integrations. Most cloud service providers can offer the scale, but many will differ on first-party security and access controls.

Once the storage layer is decided, organizations need to prepare the migration process from on-premises or another cloud provider to a new cloud environment. AWS offers Snow Family for on-premise migration and AWS DataSync and Transfer Family for online to online transfers. For organizations deploying operations on more than one environment, such as multi-cloud or hybrid cloud, AWS also offers a Storage Gateway which can connect to the other storage environment.

See also: What is a Data Lakehouse?

After migrating to a cloud service provider, the process of “putting the data to work” begins. Cloud service providers offer a lot of first-party services for collecting, processing, and utilizing data. With AWS, there are services for big data processing, interactive query, real-time analytics, and business intelligence services, alongside thousands of partnered services.

Implementation of batch or real-time data streams can enable organizations to further improve the quality of data captured and reduce the amount of data depositories. On AWS, this is done through the use of Amazon Kinesis, which has Apache Flink and SQL integrations for data stream analyzation, and open-source frameworks such as Kafka and Spark Streaming. AWS Glue is able to extract, load, and transform (ETL) data from multiple sources into a consistent stream.

Once everything has been set up, there are a few key performance guidelines which reduce the overall cost of the cloud service, while potentially improving performance. These include constant focus on measuring and iterating, using horizontal scaling, optimizing file sizes, using compression and columnar data formats, and, as Lopez recommended, leveraging Amazon S3 Select and SDKs.

In the session, Han also spoke on many tools that a cloud service provider should offer for data governance. Identity and access management (IAM) is one of the most prominent, which can segment clients into district groups, provide customers and employees with faster data access while reducing breaches, and tailor permissions while maintaining centralized control. Encryption should be provided at rest and in-motion, with flexible key management on the server and client side. The capability to redact sensitive data with serverless code and retain data for user-defined periods was also brought up as valuable for a cloud service to provide.