First Steps toward Leveraging Enterprise ChatGPT

What started as a public beta of the breakthrough solution ChatGPT has taken the world by storm. Worldwide, people (including me), some without a technical background, rushed to try it out. Perhaps because it seemed like a grand experiment or a game which attracted such a large audience. We had no sense of what it could do or how it could be used until we started “prompting.” It has taken a few months, but applications in all commercial and non-commercial spheres have emerged. And they promise to solve long-time problems like how to make it easier for humans to interface with information or language.

We are seeing data conferences pivot to focus on generative AI, mostly because this is the topic that technical and business audiences alike are hungry to explore.

What generative means for the enterprise still remains to be discovered. But to those watching the technology landscape, it’s clear that technology providers are not losing time as they put their agility to the test to embed ChatGPT within their offerings at least as an interface to their solutions as Ed Thompson, Matillion’s CTO and Co-founder described.

Here is a summary of our conversation. (See Ed Thompson’s bio below.)

CDI: You mentioned that the Gartner AI & Analytics Summit was different from other large conferences–all the conversations you had with enterprise users and technology providers were about data and analytics. Were there any conversations that were surprising or particularly insightful?

Ed: It’s clear that there’s a shift coming. Everyone is finding that reality is changing a bit. Take the technology axis. Some of the advancements in AI and large language models are changing everyone’s job a little. Look at how the funding reality for growth businesses is changing. Half of the expo hall is probably made up of similarly funded growth businesses like Matillion. What that means is going to play out over the next year. Hopefully, we won’t get too many shocks but whether we do or don’t, there’s still going to be a change over the next year. After synthesizing everyone’s opinions and thoughts here, I think “business as usual“ is not going to look the same in a year’s time.

CDI: LLMs and ChatGPT are making people either not sleep because they’re so excited or not sleep because they’re so worried. If you were to call out an emerging technology trend that is really going to have an impact on your company, what would it be?

Ed: That’s the big one. We’ve spent some time trying to get a couple of levels deep on them because you’ve got to cut through the enormous amount of hype. But there is definitely something new at the core. ChatGPT is very good at some things and a lot less good at others so I think stage one is don’t ignore it. Then start leveraging it in the ways it works really well. So I was sort of glad to hear that some of our team in marketing and sales are already using ChatGPT to proofread and to punch up emails, and to make things more readable and that’s a really good use for it. That’s just driving efficiency in the business. Similarly over on the engineering side we find it’s brilliant at code sets. I’ve written some code by prompting, “Please write me some unit tests around this code.” By doing the job more efficiently, it’s making code more resilient. It will be really interesting to see how that plays through. Does that change the size of the engineering team? Does it change the output productivity? Nobody knows right now.

CDI: What’s standing in the way of some organizations really leveraging LLMs and ChatGPT?

Ed: The other aspect of it, which is one of the reasons why ChatGPT is very good at coding and language is because it’s been fed with lots of training code–like the whole of the internet. For data use cases it’s quite challenging for a company that has a particular model or a particular language where there isn’t an enormous data set to train on, just a relatively small data set to train on. If you train on a small data set you don’t get such fantastic results on the output. Some companies have large volumes of metadata about what they do and what their customers do. They are in a much stronger position. There have been some clever moves, like Microsoft buying GitHub, which means they have access to so much data that a customer doesn’t get. Now they can feed that into the algorithm and get great results, great tools out the back of it. That’s a real challenge for smaller vendors because they can’t just go up to GitHub and get much of the world’s programming.

CDI: GitHub is still open source and an open repository, so anyone can train their model on the code, but you mentioned the metadata. It seems that it often comes down to the metadata, which is where you get those rich insights. Do you see a way to get access to the metadata for building models?

Ed: Sure, it’s easier if you own GitHub. Many organizations have been able to build up a large cache of data about what their customers are doing. They’re going to be in a better position to leverage that as training data. The final bit that’s particularly interesting for Matillion as a player in the data integration space, is that there are lots of people who want to build their own large language models. One of our partners is Databricks which has customers with huge amounts of data from various sources and they’re using Databricks to build LLMs. What’s exciting for Matillion is that, as with any AI or ML, getting the data and doing the right data preparation are key. Matillion’s role is to get the data, provide access to it, and transform it into a format that is suitable for feeding an LLM.

Take the idea of scraping GitHub–the data needs a heck of a lot of transformation and integration to make it a suitable data set for training. So even if a company were to analyze this data, they would still need assistance in preparing it to become good training data. I think being a kind of data broker in this new world is a really great position.

The ability to build resilient data pipelines is still critical, because even with ChatGPT, when OpenAI went live with it, it was two years out of date. They’ve updated it some since then. It goes to show that actually constantly feeding into the data is hard. Matillion is all about the constant movement of data and constant transformation, even after launch. Everyone is talking about creating data products, but few people are talking about day one, day two, day three. Updating the data is going to be absolutely essential, right? We’ve all seen how horrific the mistakes are, or at least frightening and insulting. The problem is that these chat algorithms can have output that’s only as good as the training data. You have to keep feeding it more data or less biased data so that the bias is written out of the equation if you will.

CDI: You mentioned resilience. What would you Matillion brings are the differentiations that you bring that are really necessary to for people to be able to crack this opportunity or this challenge?

Ed: Matillion’s core strength has always been data transformation. When we started using cloud data warehouse technology we did our transformations in the cloud data warehouse instead of doing it as data was moved in memory. Customers who wanted to keep an on-premises data warehouse could still use Matillion to do the transformations their business had always relied on and use their low-code tools like Informatica or Talend to maintain their data team’s productivity.

From where I’m sitting, there seems to some tension in data teams. Our customers’ typical data team tends to have a mixture of people who come from a lab-coat background or from analytics tools like Informatica and Talend. These tend to consider the business user more. They are very data literate, understand the value of local tools getting the job done. And they’re being met by customers coming from the engineering users and are driving to do engineering best practices. They want to write code and are using frameworks like dbt or just pure SQL or Python and SQL for using notebooks and such. We are catering to both types–there is no reason why you can’t have low code and high code all orchestrated together in your data pipeline. And sometimes you can actually transition between the two.To give you a concrete example, our data productivity cloud, puts source-control on everything we do. That’s always been a difficult thing to do in a lab coat. We want to crack that so customers feel like they’re managing their data assets in exactly the same way as they would manage source code assets on a development team. That’s seen as the best practice probably because it is.

CDI: Source control is probably not just hard for the lab coats on the data team to deal with. I think that it might be a foreign concept to those who create products for the business user.

Ed: if you look at data teams that are low-code only, it’s not that they don’t do it. It’s just that they’re less mature and don’t think in that way.

CDI: The business user is having so much more influence now on how data is accessed and delivered. What are you seeing with your customers?

Ed: I’ve always been surprised by how big our customers’ data estates grow. Good data and good data integration leads to good analytics, which leads to more questions, which leads to a request for data. The cycle goes around and it matures. Despite the fact that we built on these really scalable cloud data platforms, customers were still managing to come in with scaling problems as data volumes got really large, but also the amount of analytics and data transformation they were doing. We’d expect them to be running 10 or 15 simultaneous data pipelines but they were running 100 pipelines simultaneously.

One of the drivers for this growth of data is the number of sources that are available now. We have connectors to almost everything that our customers come to us with. For example, we have connectors to SnapChat and Instagram. I’m not sure where the business value is with those. Then customers don’t always ask themselves whether the executive team really needs that data. And does it need it in real time?

When customers ask about streaming data and real-time date, I draw a chart with cost along the bottom and time on the vertical. The more speed you want will cost more. If you can live with data being one minute out of date, you can have that relatively affordably. If you want it in one second, that’s going to cost a lot. If you want less latency than that it’s very, very expensive.

So 100 pipelines to maybe as many sources and data formats with various latency rates. Making sure that scales and remains resilient is not easy.

Customers have been moving away from deploying infrastructure on-premises. They were super happy to get out of the data center into the cloud. Now they don’t want to run instances in the cloud. It’s expensive. Besides they don’t want the management burden. The only thing they want to worry about is data sovereignty, which is usually solved by having a hybrid approach. In end, they want to make sure their data pipelines run, and if they’re not running correctly, for it to not be their problem to fix.

—

Ed Thompson Bio: Ed Thompson is CTO and co-founder of Matillion. He started his career as an IBM software consultant, and spent 11 years consulting for some of the premier blue-chip companies in the UK. Along with CEO Matthew Scullion, he launched Matillion in 2011 and set about building a crack team of data integration experts and software engineers. He and his team launched Matillion’s flagship ETL product in 2014, which has driven the company’s growth ever since. Ed’s strength is his ability to bring together best-in-class technologies from across the software ecosystem and apply them to solving the deep and complex requirements of modern businesses in new and disruptive ways. He is a graduate of the University of Salford with a degree in Computer Science. A proud father of three (plus two dogs), he has recently taken up training assistance-dog puppies for blind people.