Developing Secure, Compliant Data Products with Databricks Lakehouse Apps

Databricks’ head of product management, Shanku Niyogi, explains how the new Lakehouse Apps helps developers build data-intensive applications that perform with speed and meet security and compliance requirements.

The Databricks Data and AI Summit (June 2023) featured the release of new features, products, and cloud environments aimed at reducing friction for application developers, partners, and customers. Partner Connect allows customers to build out their modern data stacks with validated products from Databricks’ ecosystem, while the Marketplace features curated data sets (public or commercial), AI models, and AI notebooks to give customers a head start on their projects. Lakehouse Apps, on the other hand, serves the needs of the developers who are tasked with building applications that feed from or into the Databricks platform.

Shanku Niyogi, VP, Product Management, at Databricks took the time to take Cloud Data Insights behind the scenes of Lakehouse Apps and give us his perspective on what makes it a game changer for application developers. On the surface, the fact that developers no longer have to be satisfied with coding on their laptops or taking the time to request the right cloud instance for their work is compelling enough. But there’s more to it, as Shanku explains in this interview.  

CDI: Instead of starting off with interview questions, let’s start with you just sharing what you see as the real value of Lakehouse Apps to application developers.

Shanku Niyogi, VP of Product Management, Databricks

Shanku: Databricks Lakehouse Apps, for the first time, actually brings data and AI applications to the Databricks platform, where applications can run directly on Databricks next to your data. So that does two big things that we’re excited about. One is for customers–it just dramatically expands the set of use cases where you can get more out of your data and AI. It simplifies access to a whole range of components and applications that can make use of your data and AI. And as they develop applications, they can then run those next to their data securely.

The space of data and AI innovation has just dramatically expanded over the last year, as you know. The advent of LLMs and generative AI is accelerating that even more. Through Lakehouse Apps, we’re going to make it easier for all of those developers who are building interesting and innovative solutions to bring those to Databricks customers through the Databricks Marketplace, where they can now reach over 10,000 Databricks customers with their solutions. For a developer, the Marketplace becomes a very compelling way to distribute your applications instantly, and then customers can install them, secure them, manage them, and run them with the same ease as Databricks itself. And the same tools.

Traditionally, as you think about developers building applications that are on top of data and using AI models, the challenge is always getting those things in the hands of users, right? There are multiple reasons for that. The biggest reason is the data itself. Data is often the most guarded thing that customers have, and providing access to applications comes with all sorts of complexity, such as clearing legal and compliance hurdles. And then, from the application developer’s side, there’s a lot of complexity to navigate to make sure that the things you’re building are conformant and compliant. The second reason is those data and AI workloads often require pretty significant infrastructure in terms of CPUs and GPUs. The app developer can’t just pick a cloud to run their apps on because customers are running on many clouds in many regions.

Often, the data actually decides where these applications can run. There’s a pretty high cost there for developers. So, with Lakehouse Apps, customers can run apps directly on their Databricks instance next to their data. That means that those applications can use data, they can use models on that data, and the data never leaves the customer’s instance. That’s a big change. It means that customers can now use those applications and still be assured of the same kind of security and compliance that Databricks provides for the data.

Being able to run your application securely in the customer’s data plane requires either compromising on what you’re building, where you have to rewrite parts of your application in scripts or SQL or proprietary frameworks, or you can give a customer a container or a VM and say, here, run it yourself. That’s, of course, not the ideal way to do software distribution these days.

We are solving these challenges by basically using a containerized runtime on our compute platform where developers can essentially build their applications in any technology that they want. They can use Spark on our clusters to run jobs. They can use our models and use models that have been trained on Databricks data. They can use our jobs API and pipelines API to ingest data. And so applications

See also: What is a Data Lakehouse?

CDI: Lakehouse Apps seems like it will make the handoff of data from data engineers more streamlined, even allowing them to do their own data ingestion. Now, when they get a request from the business for some data artifact, they don’t have to wait on data engineers and the compliance team. When we think of self-service, we imagine a business user, but Lakehouse Apps gives the application developer a high degree of self-service. What factors would you say are impacting developers who work with data?

Shanku: I think the same way that open source or the cloud powered the last wave of application development, data and AI are powering the current wave. If developers cannot run their applications and have access to data securely, you end up with all these little islands of data that are stuck with the applications. We’re trying to make it easier for any developer in any organization to build their applications directly on the data that’s in their Databricks lakehouse. They also need the tools and services that they use today, which they can just bring to Lakehouse Apps or use what is available there.

CDI: You are clearly paying very close attention to governance and security issues with data. It’s a common complaint that security is often considered too late when building a product or service. Was it considered in the design of Lakehouse Apps?

Shanku: That was very much the primary motivation. Application developers are very capable, and they can build just about anything, but the real challenge in terms of getting those applications in the hands of the users is the data they’re consuming. I’m sure you’ve seen recent stories in the news about EU regulations requiring major software companies to rethink where their applications need to run. Our goal was to take all of the work that we’ve done as Databricks to earn the trust of customers to be in every region, to have secure infrastructure that keeps customers’ data secure and compliant, and be able to provide a way for applications to be able to take advantage of that. So, a key design consideration for running those applications in Databricks is those applications can now use data all within the customer’s data plane, and the data never leaves. And so you don’t have to worry about who you’re giving the data to, whether it’s anonymized, etc.. Every application will run in a sandbox where the administrator can essentially control exactly what data that application can use. They can also control who can access an application.

Data regulation in the AI world is just going to get more complicated. By moving the applications closer to the data platform, the apps can now take advantage of the controls you configure in the data platform itself. You can govern in one place, and applications automatically pick up and are bound by those requirements.

Whereas, if applications are running everywhere, it’s a nightmare for everyone to manage and also for the people building those applications. They have to navigate that entire data compliance space instead of working on the value they’re bringing to customers. We help them keep customer data safe.

CDI: So the lakehouse inherited the reputation of being a “vault”  from the data warehouse? So, it has the highest level of protection of any data repository that a company would have. When you talk with customers, do you find that you just have to convince the CDO and CIO of your security capabilities, or do you get others in the organization comfortable with Databricks’ level of compliance control and security?

Shanku: Absolutely. We have two sets of conversations around Lakehouse Apps. The first is with partners, startups, and others who are building an AI solution. And they’re getting this from their customers who have a long list of compliance requirements that they need to meet.

Their customers want to know whether they can run an application in their own virtual private cloud (VPC) or in some other place. And so, of course, this solution will make it a lot easier for them to kind of go to market. And then for customers, you know, we have a lot of customers who are building in-house applications on top of their data, so they have development teams already doing this. Yes. And I’ve, I’ve heard from folks in our field where people are hand installing these applications in the VPC on some VM or other kinds of things that, that really don’t scale. And it is all to kind of get through that vault and be able to run things close to that data. We’re building out this kind of sandbox where we want to be very clear about what controls are being used–we’re using the exact same controls the customer uses to secure data and Databricks. To the administrator, applications, notebooks, dashboards, or other data assets can be secured the same way. That’s the conversation we’ll be having with customers.

See also: How the Data Lakehouse Might Usurp the Warehouse and the Lake and 7 Data Lake Best Practices for Effective Data Management.

CDI: That’s a powerful combination–the flexibility of whatever deployment or development too, but with the rigor of the enterprise’s security standards. The press release for Lakehouse Apps mentions that your partners were essential in gathering requirements, but it also mentioned some early partners. Can you tell me about their role and what they brought to the project?

Shanku: Ultimately, we want every developer to be able to build applications on Databricks. If you look at what development teams and enterprises typically do, they don’t build their application from scratch. They use other services. They sometimes use tool builders and other platforms to build their applications. In order to make Databricks a great place to build those applications, we first need to reach out to those other software vendors that are building applications in the ecosystem. So our approach has been to start with the partners first and get them onboard our platform to make sure that as the broader base of developers comes onto Lakehouse apps, they have a set of tools and services to work with.

I’ll give just a couple of examples. We’re launching with a company called Retool. So, Retool makes it very easy to build business applications and in-house tools on top of data. They’re a low-code, no-code software development environment with a huge amount of traction in a number of enterprises. Retool has a private version of their product that you can install yourself in a VPC if you need to run against private data. And so, we are working with them to turn Retool into a Lakehouse app so that if a company wants to use Retool together with data in their lakehouse, they’ll be able to go to the Marketplace, install Retool, and start working right away.

The second one, which is featured in the keynote, is Kumo. Kumo has a solution for building and using graph neural networks for predictive AI so that you can basically write queries about the future as easily as you can write SQL queries about the past. So you can point to a bunch of data and then write a query like, what will the ten most purchased products in the next 60 days be? Underneath the covers, what they’re doing is training a predictive model based on that data and then providing access to it. This is a great solution for people to use when they’re building applications, but again Kumo needs the data. So, by building Kumo as a Lakehouse App now, a developer can go to the Marketplace, install Kumo, run it as a Lakehouse App itself, and then build on top of it.

CDI: That same approach might be very useful to companies who want to build a custom generative AI solution trained on their own data. That’s emerging as a dominant use case for enterprise generative AI. What are you seeing?

Shanku: That’s legitimate. These days, AI, and especially generative AI, is driving an incredible amount of innovation. A lot of that is about speed, but then a lot of it is about the quality of the data and how you can use it in a secure way. A lot of the applications that are going to be Lakehouse Apps will use generative AI models under the covers. Kumo, for example, uses a predictive AI model from another Lakehouse Apps partner–an LLM for Code Assistant. Lakehouse Apps will unlock many opportunities for people who are already doing interesting work.

See also: First Steps Toward Leveraging Enterprise ChatGPT

CDI: Is there any testing of the Lakehouse Apps? Who says, yes, this app works with DataBricks in the way that it is intended?

Shanku: So, first of all, a lot of it comes from the sandbox model itself. If you think of the equivalent of something like an iPhone app, every one of these apps will run on our infrastructure in a sandbox that we have configured, not the developer. We configure the sandbox automatically. When you install the Lakehouse App, as a user or administrator, you will get to customize what resources that application has access to. So, for example, those applications cannot connect to the external internet, which means your data stays inside Databricks, so a lot of that security comes by default.

If there are additional permissions that the application needs, it can surface that to the administrator, and the administrator can then decide whether to share logs with the customer or with the application vendor and so on. The administrator is basically always in charge. Now, when a developer publishes something in our Marketplace, we will, just like your iPhone marketplace, run a set of things such as security scans to make sure there are no vulnerabilities. As we roll out updates, we will roll those back if there are any issues. So, essentially, through the marketplace model, we will have some additional checks to make sure that the Lakehouse Apps are well-behaved, but most of the security just comes from the way the application is designed.

CDI: What’s the business model associated with offering solutions in the Marketplace?

Shanku: I think developers and vendors tend to choose their own business model. So, for example, some of the companies that we’re talking to already have a per-user business model. That’s great, right? Because when they run as a Lakehouse App, they will be able to use the compute that’s in Databricks, and the customer will get billed as though it’s just another workload using Databricks. The customer will be billed for their compute by Databricks. And so the developer can just have a simple per-user charge and not worry about cloud consumption costs.

See also: Cloud Customer Reprioritizing Cloud Spend Rather than Cost-Cutting

What I hear from a lot of these vendors is they don’t want to be trying to figure out a consumption-based business model yet. So the ability to use Databricks compute and have the compute part of the bill be paid by the customer to Databricks is very helpful to those partners. We are going to explore some other business models. Many vendors are already using the Marketplace to offer free trials and other free assets, and customers can evaluate the data product directly inside Databricks before choosing to buy.

At the end of the day, Lakehouse Apps is going to make it easier for developers to reach all of those Databricks customers securely. And customers will have a huge variety of data and AI solutions to take advantage of the data in their lakehouse that they can discover through the Databricks Marketplace.

Bio: Shanku Niyogi is the Vice President of Product Management at Databricks. Shanku has over two decades of experience in product management and software development. Shanku has held senior positions at companies such as GitHub, Google, Chef Software, and Microsoft. Shanku was responsible for launching or commercializing several products, including Actions, Codespaces, and Copilot. At Databricks, Shanku is responsible for leading the product management team and growing engagement and revenue. Shanku graduated with a BMath (Honors) from the University of Waterloo.

Leave a Reply

Your email address will not be published. Required fields are marked *