Data Engineering and AI: Navigating the Modern Data Landscape with With Chad Sanderson, CEO & Co-Founder at Gable Artwork

Evolving the Enterprise

Welcome to 'Evolving the Enterprise.' A podcast that brings together thought leaders from the worlds of data, automation, AI, integration, and more. Join SnapLogic’s Chief Marketing Officer, Dayle Hall, as we delve into captivating stories of enterprise technology successes, and failures, through lively discussions with industry-leading executives and experts. Together, we'll explore the real-world challenges and opportunities that companies face as they reshape the future of work.

All Episodes

Evolving the Enterprise

Data Engineering and AI: Navigating the Modern Data Landscape with With Chad Sanderson, CEO & Co-Founder at Gable

February 15, 2024 • Season 3 • Episode 4

Join us in this newest episode of Evolving the Enterprise as we dive into the intricate world of data engineering with the esteemed Chad Sanderson.

Chad imparts his profound knowledge and experiences as we traverse through the challenges data engineers face in different stages of a company's growth, delve into the cultural implications of technological advancements, and dissect the intricacies of low-code automation and its impact on data trust. Furthermore, Chad provides an insightful forecast on the future of generative AI and its potential to revolutionize our approach to data.

This episode is an essential listen for professionals in the tech industry, data enthusiasts, and anyone keen to understand the pivotal role of data quality in our technologically evolving world.

Sponsor
The Evolving the Enterprise podcast is brought to you by SnapLogic, the world’s first generative integration platform for faster, easier digital transformation. Whether you are automating business processes, democratizing data, or delivering digital products and services, SnapLogic enables you to simplify your technology stack and take your enterprise further. Join the Generative Integration movement at snaplogic.com.

Additional Resources

Follow Dayle Hall on LinkedIn
Follow Chad Sanderson on LinkedIn
Learn about the Evolving the Enterprise Virtual Summit
Turn ideas into production-ready workflows and data pipelines with Generative Integration
Back to basics: What is iPaaS? | What is Data Integration? | What is Application Integration?

Data Engineering and AI: Navigating the Modern Data Landscape

Dayle Hall:
Hi, and welcome to our latest episode of our podcast. Today, we have a very special guest, Chad Sanderson, who is a data leader with an- I don't even know how much passion he has, but he has a ton of passion for applying product thinking to big, holistic data challenges. As most of our listeners know and most of you out there reading, if data is the new oil, guess what, that comes with a bunch of challenges. So we're going to address some of those today.

He's currently leading the fastest-growing data quality community on the internet. We're going to ask him about that in a second. It's called the Data Quality Camp. But he has a bunch of other interests and a bunch of other commitments, including a scout for Sequoia Capital. So maybe we'll get some insight into what he's looking at, but also looking at investing in any kind of data infrastructure and machine learning products. So a really interesting perspective there, community as well as some of that investing, Chad. So welcome to the show.

Chad Sanderson:
Yeah, thanks for having me. Great to be here.

Dayle Hall:
Yeah. Before we get going, we have a couple of questions that we're going to dig into around data governance, change management, communication, all that kind of stuff. But before we do that, give me a bit of a synopsis on who you are, how you go into this specific role. And I'd love to know a little bit more about what the Data Quality Camp actually is.

Chad Sanderson:
Yep, absolutely. So I'm Chad. I've been in data management and data infrastructure for a little over the last decade. I've worked at companies like Microsoft and Sephora, Subway and Oracle. My last position, I was leading the data team at a company called Convoy, which is a late-stage freight technology company. It's like Uber but for freight. It's a two-sided marketplace, lots of machine learning, lots of data.

I saw a bunch of interesting problems while I was at convoy that I felt didn't have solutions. I was really struggling with applying data quality for machine learning and artificial intelligence. Everywhere I looked and everybody I asked about how to solve those problems, it was usually coming from a vendor, like a company that was selling a solution. And I wanted to learn from people. That's why I ended up starting Data Quality Camp.

It was a place where professionals who had been there and done that could share expertise with others who were trying to learn in a nonvendor-oriented way. There's definitely vendors in there, but I encourage everyone to argue from their theory and experience and practical hands-on guidance instead of making sales. If you do that, you get banned. We started the community in November. It's now been not quite a year, but we're at around 8,000 members and growing every day.

Dayle Hall:
That's incredible. It's interesting as it feels like generative AI has now reignited a lot more discussions, again, around data and data quality, and the ethics of AI and the data that goes in. What I'm excited about, you're one of the few people I've talked to recently that has started something like this, to bring in, I don't want to say average people, but to bring in people and not just brands to help understand implications to help potentially set guidelines and help guide other companies. How do you think the average person- you just said that you kick out anyone that's trying to sell anything. But why is it so important for us to be individuals in this process and not just get, I don't want to say railroaded, but not just get told how to act and what data should be used for by large companies?

Chad Sanderson:
Yeah. To be clear, I don't have a big problem with brands. The advice or the guidance I usually give them if they want to join the community is to say, look, I think that you started your company because you believe in something. You believe that whatever it is that you're selling, that was the right way of doing things. Maybe they're just a complete shill and they have no heart at all. But if that was me, that's what I would do.

So what I tell them is talk about that. Don't talk about your company and your products. Talk about the ideas and why you think this is the right way of doing things. And if other people agree with those ideas, then they'll buy your product. And that's it. And you've convinced them because of the value that you've created and not because of the marketing that you've done. That's the part that I don't really like.

To your point, the reason why I think places like Data Quality Camp are important is that it's the practitioners who are the ones on the ground. They're the ones that have to deal with all the outcomes. At the end of the day, if you go out and you buy a vendor, and you pay half a million dollars because of some promises that they make, and then two and a half or three years go by and you've got all these data issues and governance issues and you've let PII slip through, and now there's a lawsuit filed against you for $300 million, I mean, worst thing that you can do is get rid of the vendor. You've already paid them $1.5 million, so yeah, they're probably going to be sad about not getting any more money from you, but they're going to be fine and you're going to be out however much. I think in a day, it's all on our head. We have to make good decisions for ourselves, and that requires being knowledgeable about the problems and the requirements to solve them.

Dayle Hall:
Well, there's clearly a need for it. You said you've had this going for about a year. There's 8,000 people that jumped in. So it feels like there's definitely a thirst out there for people to be part of this. And people clearly very much care about this type of community. Because I like what you said. It's the practitioners that have to deal with the outcomes, good or bad. So as practitioners, if we can share these ideas in a community, and hopefully, at some point, we can get better in managing data, in making sure we make the right decisions.

What has the reaction been when you've been in the community and you pose some harder questions to people? Are people nervous about getting involved that they want to give you their opinion? Because in one of my previous jobs, I worked at a community company called Lithium. They sold community platform. And one of the things that was always critical is how you get people engaged and how you drive those discussions. But this seems like such a hot topic. It feels like, do you still have to drive the discussions, or are people just we're in, we’re here, we've got so many questions? How's the vibrancy within the group?

Chad Sanderson:
Yeah. I mean, we have somewhere between, I would say, four to six pretty big discussions that happen a day. I don't do too much of anything. I used to pose a lot of the questions myself, but nowadays, I just let it organically flow. And I think you're right, data is one of those places where people are extremely opinionated. You don't find too many folks in data especially that are interested in a topic enough to join a group like this, that are the- it’s one or two sides. Either they don't really have any opinion at all and they're looking to form that opinion and learn, or they're unbelievably opinionated to the point that they will have a fistfight with someone who disagrees with them. And so when you put both of those types of people in the same place, the former is going to ask a lot of questions because they're trying to learn, and the latter is going to be rabid about answering those questions.

The other interesting thing about data is that the people who are willing to come to blows often have very different perspectives. And so they will go back and forth in the comments, which I think is always super interesting because you're never just getting one side of an answer. You're always getting very opinionated, very educated, multiple sides. You're seeing the argument from multiple sides with two people who really know what they're talking about. And I think that it's better than just going and reading a blog post from one person's perspective and making all of your opinions based around that.

Dayle Hall:
Yeah, I like that. I actually like the concept of people coming there to try and actually shape their opinion, to learn more. So you don't have to have that strong opinion. Maybe you just want to learn, because I think knowledge and understanding all the perspectives is critical when it comes to data these days.

Let's move on to data within an enterprise. Obviously, a data engineer, the importance of what they do on a daily basis is critical. What kind of challenges does your average data engineer have these days outside of what you're talking about in the community, but just managing the volume, all the applications they have to connect, probably the multiple requests from the organization? How are they managing that influx within the enterprise today?

Chad Sanderson:
It's a good question. I think that it depends on what stage of the company's life cycle we're talking about, the challenge is going to be different. If you're a data engineer for a very early-stage company, you're being asked to move exceptionally fast. That means not having the time to build out robust, scalable systems. You have to be extremely scrappy, like any other start-up. But in the data world, I think it's actually a lot worse than in the software world because these systems are unbelievably dependent on each other. At least in software, you can invest in microservices and they can be relatively siloed. And you have pretty good programs of going back and managing some of the tech debt over time. In data, we don't really have that yet. And most data engineers know that. So you're just sitting on this atomic bomb attack over time and trying to deal with it as best you can.

If you're a bigger company, so if you're more of a legacy business or maybe you're an enterprise with 1,000, 2,000 engineers, something like that, you actually have a different type of problem, where it's more of an organizational complexity and change management issue. You have a relatively small team. You're not going to see more than 20 or 30 data engineers even at a company with hundreds of software engineers, maybe even thousands of software engineers. That's usually around the limit.

And so these people, this tiny group has to deal with changes that’s happening all the time from every single part of the data technology stack. And when those changes happen, they have negative impacts on other people downstream that are consuming that data because of all those dependencies that I mentioned before. And when things break because data scientists and analysts don't understand where the data is coming from or who owns it or who made the change, all of these requests filter back to the data engineers, mainly because they have the word data in their title.

While the data engineer wants to be handling a lot of the technology that you mentioned, they want to be thinking about their data warehouse and the ETL system, and the orchestration, and the monitoring, and the testing, they want to do that. That's the exciting part. What they actually spend most of their time doing is dealing with other people yelling at them because things broke.

Dayle Hall:
Yeah. It's interesting you say that, because as you were talking, I'm thinking, data engineers sound like some kind of unsung hero, that they're always there for the organization, but it's not always seen within that. Should they be spending more of their time working on the things that you talked about, like capturing the data and making sure it gets to the right place? And you mentioned that they seem to spend a lot of their time actually dealing with people. Is there a way around that? Or is it partially because a lot of the people they deal with don't fully understand the complexity of what they're dealing with?

Chad Sanderson:
Yeah. So two questions there. The first question was more about, what should they be doing? And the second question was about, okay, well, how do you deal with this people problem? I think that the data engineering role is pretty special. It's a special blend of skills where you're really a software engineer who understands, has a lot of depth when it comes to managing technical data systems. So you're thinking about data streaming technologies. You're thinking about how you model and transform data, how you orchestrate it, move it between different systems at different speeds, and then ultimately, how you feed that data into some application downstream, whether it's a machine learning model or a dashboard for visualization, or whatever it might be.

And you're really worried about the architecture. You're thinking of, what are the tools that we need? Why do we need to use those tools? When are we ready to scale up our tools? How do we actually leverage those tools in the most effective way possible? And that on its own is a really challenging problem because you're seeing new use cases emerge every day. What was acceptable yesterday may not be acceptable the next day. If you've got data that's flowing into a report the CEO looks at once a week, well, then you only need the data refreshed once a week. But if tomorrow, someone says, I want to use that data to make real-time predictions in our app, that once a week doesn't work anymore. And that's a technological shift. And the data engineer is the one who has the burden of thinking about how to do that.

To your point on the people problem and does it come from the organization, how the organization is actually set up, I think that's exactly the problem. And this is a very long- and I could talk about this for a really long time, but there's a whole history here. And it's really the history of data and software engineering and how those two disciplines have intertwined with each other over the last 20 or 30 years or so.

Back in the day, 1970s, 1980s, there was no software engineering. That wasn't really a thing, or at the very least, it was just the beginning of the twinkle in someone's eye. But data was in full bloom. That was usually the system that many companies started off with. Ford implemented a data warehouse in 1987, and they shaved a billion dollars off of their cost because they were able to see, oh, where are the bottlenecks in our manufacturing process as measured through data from all of their facilities.

And in that period of time, data was really treated as a first-class citizen at the company. You had a data architect, and that data architect sat really at the top of the organization. And they were the ones that decided, hey, software engineers, this is how you're going to construct the data models for your databases and this is how your applications are going to work. And they managed all the transformation. So they figured out how to take all that data and turn it into an analytical model that a data scientist, they weren't called that at the time but they are now, or an analyst, whoever, could look at to make decisions. And then they would actually produce that model and they would choose all the technology. So you had that sort of data flowing in from the production systems, and there's one person in the middle that was orchestrating all of it. And then it was flowing out to the data consumers who were using it.

That system was very slow. And it took a long time to ever update it. They were absolutely a bottleneck, but it worked. And then software came. And software changed everything and the cloud changed everything. You had Meta, formerly known as Facebook, and Amazon and Google. What those companies all did is they said, no, no, no, that's not the way to do things. The way to do things is to create a software engineering team and to empower the software engineer to build as many applications with as many features as you possibly can, as quickly as you can. And your goal is to experiment, see what features work and what features don't, keep the ones that work, get rid of the ones that don't, and just consistently iterate over time faster, faster, faster. That became the de facto operating model for pretty much every start-up coming out of Silicon Valley and really the rest of the world, was this very heavy software engineering-first mentality.

That is what led to the rapid adoption of the cloud. Because the whole point of the cloud is you can go faster. You can push all of your data, all your services, or your applications into one place. Microservices, same thing. Now we're not dependent on a monolith anymore. We can break all this stuff out, so people can move even faster. And all these companies switched to that model.

What does that do to the data model where you have the guy who's the bottleneck that's orchestrating everything and he's designing everything? It blows it up. It takes this, the Looney Tunes dynamite where you press it down, it just blows everything to pieces. And that's what happened to data over the last 20 years. It's been blown up, blown to smithereens by the cloud. And we're just starting to put the pieces back together to build up the structure. But what we lack is the organizational structure that allows the data engineers to really thrive.

Dayle Hall:
Yeah. I think we’re now seeing the growth of a chief data officer-type role. But in essence, that has been because most organizations feel like we've got to get control over this. It's less about, going back to what you said before, which is building the structure in the right way, in the first place before you add more systems, more applications, and so on. And it's actually most of us, I think, are struggling.

I did a podcast earlier this week with someone else who said that- we were talking about generative AI. And he was saying that generative AI won't help you if you don't have the data architecture and the metric system in place to be able to accurately use the data. It won't do anything for you. So I think that has- you've gone through that shift of data architects, the data engineers being the core of at least the IT organizations 20 years ago. And I feel like they're starting to have their day in the sun again because of the sheer amount of organizational data that we have.

If you're out there, if there's a data engineer out there listening to this, or other people around the organizations, what are the things that you've seen have worked around getting that organization, those data engineers, I don't want to say back with a seat at the table, but more front and center as new applications are being added to the organization? How do they get back to being that position of power?

Chad Sanderson:
First of all, I agree with everything that you said. I actually had someone ask me a couple weeks ago how long would it be if an organization decided today, we're going to change our data, we're going to invest in data engineering, we're going to invest in data architecture because we want gen AI for our data, how long would it take until that was production ready? And I said, if you're talking about the average company, three to four years. And they didn't believe me. They were shocked. They're like, what, three to four years? No, no, I thought you were going to say something like six months. I'm like, no, you have no idea how bad the data is under the covers in most of these organizations. It's a war zone. It's a nightmare.

But to your question on what can the data engineer do to get back into the sunlight, so to speak, I think that it depends on how fast you want them to get there. There's always the fast way. And the fast way is you come in with the hammer of organizational change and you say, we are going to change the company, we're going to change how we think about product teams, we're going to change where data engineers sit in the organization. They're no longer just going to be restricted to infrastructure. We're now going to embed them in every single product team in the company. Every product team is going to have a data person in it, a data engineer in it. And that data engineer is going to be the advocate for the business.

How realistically is that going to happen? Well, at most companies, like Convoy where I was, we had probably 500 software engineers, and we had a grand total of six data engineers. And we had 25 or 30 product teams, something like that. And you would need at least two data engineers per product team. So you can do the math on that. Like we were at five, we needed to hire another, an order of magnitude more. A lot of companies are in that position where the people just aren't there. They don't have the ability to go out and bring in that many data engineers to embed. And even if they did, they would still, I think, be running by the skin of their teeth. So I think that if you're a company like Google or Amazon, and you have the resources to do that, great, you should do that. That's fantastic.

But I think there's actually a better way and a more realistic way. It takes a little bit longer, and it's a bit more iterative, but it works. And that is something that I call bridging the gap. Maybe just a little bit of context to give your listeners if they don't already know this, what the data landscape actually looks like at any company. Generally, it's divided into two parts. On one side, you've got the production data. And this is you have your databases, really any data source that is owned usually by either a software engineer or someone on the business side. So it might be Salesforce data, it might be ERP systems, whatever. On the other side, you've got all of your analytics data. So these are your data scientists and your analysts, and maybe product managers and whoever, the people that are consuming data so that they can use it for some analytical or AI/ML-based reason.

There's a massive gulf between these two sides. Even though all of the data that's being used on the analytical layer is coming from the production systems, the two sides actually don't talk to each other all that much. And that's a function of the organizational change that I mentioned before. We want people to move so quickly that we've said, hey, data people, you don't need to have conversations with the producers anymore. You just go and take the data that you need, and you start working with it, and you produce all the visualizations that you want.

What that's led to is the producers have no idea where their data is going, or who's using it, or what they're using it for, and whether or not it's important. And so if they decided to make some operational change, they say, hey, look, I want to take the name field in my database and split it into first name and last name. And they do that, and it ends up breaking $100 million pricing model downstream that's only prepared to take a single name column. They have no clue.

So I think if you can bridge that gap where the two sides are actually aware of each other, the producers and the application owners understand the dependencies that they have, and the people who own the data understand where their data is coming from, you can start to have conversations and begin to prioritize the data use cases right alongside the application use cases. But without bridging that gap, I don't think it's possible.

Dayle Hall:
Have you seen anyone that's- because that sounds like a big, I don't want to say it's a cultural shift, but it does feel like a lot of organizations, whilst they could mandate that you have to share more ideas, come together in different meetings, feels like a cultural change. Have you seen anyone that's been able to do that? Or are a lot of organizations too far down the path?

Chad Sanderson:
It is a cultural change, but I would say that the most exciting technology problems are solving culture problems. I'll give you one of my favorites just as an example here, code review. If you were talking to someone pre-GitHub, and you said, you know what, it's really hard to do code review, we've got so many engineers and junior engineers that are shipping code, and they're not getting anybody to look at it, and there's no eyes on it, it's just going into production, and you said, why is this issue happening, what most people in the company would say is that's a culture issue. We just need to get better at doing code review. We need to be having more meetings. We need to be more consistent about doing these things.

But then GitHub came along, and now you have pull requests. And pull requests is an unbelievably simple and straightforward way to ensure that the people who need to be reviewing code are alerted every time something changes. It's a change management problem. In fact, all of GitHub is really just a change management problem. Version control is just change management. That's all it is.

I think that this is an area of data which is totally missing, that like GitHub, can also exist. You can also know as a software engineer when you are deploying something, am I going to be breaking anything downstream, and if so, who and what. And that person who is potentially being broken will know, oh, someone's about to make a change that's going to affect me, I should go and advocate for myself. And when people start having those conversations, you're now building the awareness on both sides.

Dayle Hall:
I really liked what you said. I just want to make sure I got it right because no one's ever expressed it this way. I don't know if you said the most exciting technology innovations are actually solving cultural challenges and cultural problems. Is that right?

Chad Sanderson:
Yeah.

Dayle Hall:
That's unique, I think, from other discussions that I've had around these kinds of podcasts because I think within tech- I'm a marketer, so obviously, I'm here to position things in a certain way. But we always think of it as we're going to solve a business process, or we're solving a technology challenge. But that's a really interesting way of looking at it, which is solving change management or a cultural problem, and how impactful could technology be if you started from that perspective.

Chad Sanderson:
Exactly. One of the ways that- when I talk about this, I often use the example of Tesla. Even though it may not seem like it, I actually think Tesla is solving a culture problem. It's a very broad, almost universal culture problem. But if you went to people in the 1990s or 2000s, and they said, hey, why aren't you driving an electric car, it's like, well, there are electric cars, they're small. They're not really cool looking. They're not very interesting. They don't hold the charge, but they exist. And as long as we just all cared about the planet more, then yeah, maybe we wouldn't buy these big gas guzzlers, if we just cared more, if we just took more time out of our day to think about climate change.

When Tesla came along and they basically said, you don't need to give up all the cool stuff in order to care about the climate, if that's your thing. And you don't need to give up the range. You don't need to give up the fact that you have four kids and you need to be able to take them to school every morning, and you commute to work and you forgot to fill up your tank of gas on the way home and you need to be able to run and do that. That, to me, thinking about that and how human beings are using technology, how they use it to solve these problems out in society, I think is a really useful way of looking at it.

I know in my personal life, and when I'm working with enterprise tools, the things that I keep going back to over and over and over again, it's the stuff that is solving a culture problem. I use Slack, probably every day, as a communication mechanism. I remember I didn't use Slack at all before, but now talking to other people, it just becomes so easy. That's not a tech problem. There's a lot of ways that you could have implemented that that are probably a lot more sophisticated.

Dayle Hall:
Yeah, that's a good perspective. Interesting to see whether other companies will start with that as a challenge and not trying to solve necessarily a business problem or a tech challenge but start with the culture piece. Or maybe the cultural piece comes with the tech piece. It's an interesting one.

Let's move on to talk a little bit about governance. You had this concept, or the concept that I've heard is something called a data graveyard. What is a data graveyard? And is it something that is becoming a massive problem?

Chad Sanderson:
It is becoming a massive problem. Data warehouse is a very commonly cited term in the data space that is not that understood by the modern generation of data people. And there's a bunch of reasons for that. But this is another one of those old words. It came from a guy named Bill Inman who invented it in the 1980s. This was an organizational technique for data. And it was all about mapping your data through code to the real-world nouns and verbs that existed in your business.

So instead of just having, oh, I've got a ton of data and some data is coming from here, and some data is coming from there, and let's build dashboards on top of it, and let's build models on top of it, he said, well, wait. Before we do that, we need to actually figure out what our business processes are. And then we need to figure out what are the core entities that our company cares about. Let's say that we're a company that does freight, like Convoy. We might have some core entities like a shipment, or a shipper, or a customer, or a carrier, or a truck, or a contract, or an auction. All of these are nouns that really exist out in the world, and they all have properties like adjectives. There's the time of the auction. There's the final price. There's the starting bid time. There's the ending bid time. All of these are things that we would observe in the real world if we were describing the auction on paper with a pen and pencil. We were sitting in a big room where someone is doing a real-world auction. These are all important things.

And then you need to figure out the relationships between one entity and another entity. There's some relationship between the auction and the buyer at the auction. There's some relationship between the truck driver and the facility to which the truck is being delivered. Once you understand that full picture, now you can start doing analysis and asking some very interesting questions. You can understand, okay, well, who are my top buyers by region and by time? And I want to see a split-out based off the number of shipments that were taken in the past six months. You can start looking for commonalities. If you don't have this big view of the world that's mapping everything together, you're trying to solve the problem in a very isolated way and you're lacking a lot of the interesting context and information that could potentially influence that decision. So that’s where the warehouse was always intended to be.

But these days, a lot of companies don't really have a warehouse. They say that they do, but they actually have a lot of data coming from many different places. That data architect that's doing that creation process is gone. And so everyone's just doing their best to piece things together. And if you're living in this world of piecemeal analysis, it means that you don't necessarily have a single source of truth. So if you were to say, hey, I want to know how many active customers do I have today, and I go into my analytical environment to find that in the data warehouse world, there's one place where I get that from, and that number is the same for everyone. In the data graveyard world, there might be 50 places. Every single person might have a different take on what that means. And that means that I don't trust any of these numbers, and I'm going to go recreate the wheel myself. The graveyard refers to all of these queries, all of these questions that are being asked that eventually fall into decay and disrepair. And they littered the analytical environment as if it was a graveyard.

Dayle Hall:
That actually sounds probably like most enterprises will have some level of that.

Chad Sanderson:
Yes.

Dayle Hall:
Unless you're the perfect- but no enterprise is perfect where innovation is coming so fast. Again, what I always try and do, Chad, if someone's out there and they're faced with a similar problem, or maybe their organization is going through these kinds of questions, how do you get away from that? Because like you've said, how the data warehouse was originally set up, maybe it's not being used to that level of detail now, it's not being applied to the business processes, it's not being used with the vernacular of the business. But I guarantee that's happening in most enterprises. So how do you move forward with that? How do you try and make sure that at least the data you're capturing, the single source of truth? How do you pull that together? What's a mechanism? Is it another process? Is it, God forbid, another piece of software? How do you manage that?

Chad Sanderson:
Well, I think it's a combination of a few different things. And it all really ties back to incentive structures. What incentivizes someone to dig another hole in the graveyard is a lack of trust. If I provide untrustworthy data, and I'm an analyst, it's my head that's on the spike. So I am always going to do my own diligence. And even if that means recreating the wheel all over again, I'm willing to do that. In order to solve the problem, then conversely, we have to provide trust. It should be extremely obvious what the most trustworthy source of data is. And if an analyst takes a dependency on that, if that data is wrong, it's not their fault. It's the fault of whoever said it was trustworthy. If they have this responsibility of making that data high trust, and they don't, that's their problem. It's not the problem of the analysts themselves.

So the way that we thought about this at Convoy, number one, was we had two different environments for our data. One environment was for prototyping and sandboxing. And we were replicating all of our data from our production systems into our analytical systems. And people could go nuts. They can have fun. They could play around. They could try out different things. That type of experimentation is still very necessary. In data, you're never going to get away from it. And especially in this more kind of modern software engineering, microservice-y environment, it's just par for the course.

However, if you wanted your data to be productionized, and productionized here means shareable outside of yourself. So if I have a dashboard and I want to share that dashboard to the CMO, then that means I need to ensure some level of trust and quality. We had a requirement that you needed to add a contract to your data. And that meant, here's the data that I expect, here's where that data is coming from, here's what I define as a trustworthy expectation. For example, I always expect this field to be No, I always expect this ID to be an eight-character string. If it's not, that means something has gone wrong somewhere in the process.

Once you have that contract in place, you are saying canonically, this is a trustworthy data asset. And so the next time that someone comes in and wants to route around in the mud and stuff like that, it's great. But if they want to create a contract on the same data, well, that contract already exists. So all you need to do is append more data to an existing contract and leverage what's already been built effectively.

Dayle Hall:
Recently, there's been a lot of talk around, there's different terms for it, citizen developer, citizen integrator, low-code, automations, where people that aren't necessarily a software engineer or within the IT organization or a data engineer, where they don't have to have that kind of knowledge, but they still get access to all the data that they need. They don't have to go and wait for IT to give it to them or the data team. And this concept of automation RPA as a market grew very quickly. Has that helped us in this situation, or has it just created, I would say, more graveyards? Or is it if you don't have the trust behind the data in the first place that an automation hasn't fixed it, in fact, it could have exponentially made the problem worse?

Chad Sanderson:
You got it. That's the biggest and scariest thing with gen AI, by the way, is that you have all of these low-code tools that have made a lot of promises about automation, but the data that they rely on is not trustworthy at all. And it actually gets less trustworthy the closer it gets to the source system when the data originates. So it's almost like a pollutant being poured into a lake at the source, and then it flows down to everybody that consumes that into the drinking water and the bath water and everything else.

That's how a lot of these low-code systems operate as well. They make the assumption that the data that is closest to the consumer, it's safe to put all of these tools and widgets and gadgets around and you can do all this great manipulation. Well, it's all poison. I think this is a really, really big problem. And I think it exacerbates the problem because it's so easy to go in and create all of these things. It gives people a false sense of confidence that, oh, I can go in by myself, and I can create three or four or five different dashboards and answer all of my own questions, and every single one of them is wrong. The other issue is that there's not even a way of indicating that it's wrong, because the tool has no concept of right and wrong. All it has is a concept of here's the data, do whatever you want with it. But there's no governance, there's no indication of quality, and there's no indication of trust.

I think this is actually one of the biggest problems in business today. And part of my conspiracy theory, which is not a conspiracy theory because I know a lot of people who are working at these companies on the finance teams were- in fact, it was a report that I read by these guys, I think Blackstone they were called. And they said, of financial controllers, 75% of the senior-level folks said that they had good data and they trusted it. And it was like less than 30% of the actual ICs on the ground said that they had trustworthy data. And that's for financial data. That's not for auto data, or AI or anything like that, which is even worse as for financial data. So my conspiracy is over the next 10 years, we're going to see string after string after string of high-profile fraud cases starting to come out with IPO-ed companies that the numbers they have been reporting to their shareholders are not reality.

Dayle Hall:
Wow. I like that. It's a really interesting hot take. Because one of the things that I think, you mentioned this yourself just now, which is generative integration- sorry, we talked about generative integration- generative AI. I think very quickly, there's been an expectation, and I've heard a bunch of people ask this, C-level execs, could be the CEO, asking their organization, what are we doing around gen AI? They don't necessarily understand the things that you just talked about, which is trust in the data of where it's coming from. They just want to do something because it's the hot innovation, because they think it's going to give them massive advantage.

What do you think the appetite is for senior execs to listen, to hear the team that really knows about this, whether it's data engineers, whether it’s the chief data officer? What do you think the appetite is for them to really think about, we can take advantage of this, but there's some fundamental things we have to have in our architecture around the data first? Because it feels like a lot of organizations are going to rush into any gen AI solution because they think it's going to do this. And fraud cases aside, I think there's going to be a lot of, I don't want to say wasted investment, but there's going to be some disappointed execs out there that are going to invest and not get what they think they're going to get.

Chad Sanderson:
Oh, absolutely. And this is part of every AI hype cycle. You've been around the industry for a long time, so I'm sure that you've seen it, too, where there was neural networks at one point. Every few years, there's an AI hype cycle and it's supposed to be the thing that changes everything. And then it turns out, oh, actually, having the data that powers these models is a lot harder than building the models itself. And the investment to pull that off is- that's actually the moat that Google and all these other companies have. They have the data moat. It's not that they have a model moat. That's number one, is I think there's always going to be sort of the trough of disillusionment that executives will run headfirst into the brick wall of reality, and they just don't have the data to support the things they want to do.

However, that being said, I do think that there is much more of an appetite recently that I'm hearing from executives who are open to fixing some of their core issues in order to get gen AI going. And I think the reason for that is, honestly, data had lost its way for a little bit, like for a few decades, it was wandering around in the wilderness like Moses. And that's because, originally, you didn't invest in data unless you had a really good reason to do that. And that's because storing and computing all of that data in a server, like in a physical space, was unbelievably expensive. And in order for you to pay that cost, you needed to have a really good reason to do it. And so it was very use case driven for 1970s, 1980s, and a lot of 1990s.

But then when the separation of compute and storage happened, it became unbelievably cheap to store data. And so what all of the application teams started to do is say, we're just going to dump all the data that we're getting from our application. We don't even care if it's important or not. We're just going to put all of it into this thing called a data lake, where maybe we'll get some use out of it later. And then we said, well, why don't we just start pulling all of that data into our analytical environment so that we have access to it. And maybe we need it now, maybe we don't, but we can start doing some cool things on it, and we'll figure it out as time goes along.

And so when the cloud- honestly, from the time the cloud really began to today, no one really had a use case-driven idea of what they wanted to do with their data. They just knew that they had it and they wanted to make it available to people. So you saw a lot of dashboarding. If you look at the venture space, most of the big data companies are either tools to move data around, get it from place to place, tools to store it, or tools to do very basic analysis and visualization off of it, like Tableau and Looker and these types of things.

The executives up until now, I think it's not that they hated data. They just looked at what people in their company were doing with the data and saying, okay, well, we're making some dashboards over here. And then on the other side, I have my software engineering team who's building the core functionality of my company that allows me to walk into the offices of Ford and Disney and sign a million dollar contract. Where am I going to put my resources? Duh, that's not a hard question to answer. But now that's changing. Now gen AI is starting to say, oh, wait a second, all this data stuff that you didn't really have a use for before, that was just going into dashboards, now you can start making money off of it. Now it's beginning to become, or at least it has the potential to become just as much of a moneymaker as some of those features that your software engineering team is working on. And so once the dollar signs start popping up in these guys’ heads, now the infrastructure investment, in order to get their data in the right shape, starts to make a little bit more sense.

Dayle Hall:
Yeah. That's usually what happens, is that if eventually someone sees an opportunity to make money, they'll invest in the infrastructure behind it if they see a path to that. Going back to the automation piece, and again, we talked a little bit about low code, no code. And again, SnapLogic is very close to these challenges, particularly around integration. And there's a lot of talk about self-service. But I want to come back to what you said around trust in the data. And some of the automation is always going to fail because of where the data is and the trust in it.

It's kind of a random question. Is self-service just a pipe dream? Is it actually impossible to get there? Are there aspects of automations that you actually do within your experience, the customers, the clients, the organizations you talk to? How would you advise an enterprise to really think about what they should start automating, potentially with the data they have today before they go and create better structures or clean it up? What is something- I don't want to say simple. What is a good starting point for a self-service automation-type process in an enterprise?

Chad Sanderson:
Yeah, it's a good question. I think when people think about self-service, the big question they need to ask themselves is what do we expect people to serve exactly? If the answer is someone can go and figure out how many sales we've made over the past 30 days, that's very reasonable. I think that you can get away- a lot of companies have things like that today. You want to do something that's a lot more complicated than that. And you want to cut the data six different ways. And you want to understand the root cause analysis. It was a big thing in Convoy where we would give data to our customers who are all big shippers, like Walmart and Target and these types of things. And those customers wanted to know, for their shipping facilities that they maintained, where are their bottlenecks? Where is the delivery process slowing down so that then they could go in and optimize that. And we have all this data, so we could give it to them in such a way that they could slice it and dice it and figure out all this stuff.

If you want to start answering those types of questions, that's a lot harder. That's not just a simple SQL query that you learn 30 minutes of looking at a YouTube video. That's years of experience, understanding all of the intricacies of the code. You need to understand where the data is coming from and all of the gotchas and the history, and why is this thousand-line SQL file working the way that it's working. You have to have all the context of the data the more complex questions you actually need to be able to answer. And there's multiples of orders of magnitude difficulty in the first thing and the second thing.

The problem is when you say that to an executive, it's not overwhelmingly clear why that is. Itk reminds me of an old xkcd comic, where they said, figuring out whether or not you're in a national park, easy. I can get that done in a couple of days. Figuring out if a photo of something is a bird, I'm going to need a five-person research team in three years. To someone who's in the field, it's very obvious why that's the case, but for an outsider looking in, or you're an executive, you might be very confused by that. I think that's a part of the problem.

Dayle Hall:
Yeah. As we come towards the end of the podcast, I want to just go back to the data engineer, because I think that's something- I don't know if you watch any of the superhero movies, but in one of the Batman movies, there's a bit where Harvey Dent says- where he's talking to Bruce Wayne in the dinner, and he says, you either die a hero or you live long enough to become the villain. And I feel like what we've been describing around this role in data engineering, they're actually doing so much that people don't see. But when something goes wrong, they are- that organization, whilst you're talking about 500 software engineers and six data engineers, the emphasis and the requirement on them is significantly massive.

Do you see anything coming? Or isn't it something you are excited about, or any guidance that you would have to a data engineer around here's how to remain the hero and not the villain? Here's what you would recommend someone in that role, like here's the fundamental things that you have to make sure you do, as well as whatever you have to do with your own culture and the technology, but how should they think about their role, and what opportunities are coming for them?

Chad Sanderson:
What I would say is right now, the data engineer is still a centralized position in a decentralized world. You've got software engineers that are totally decentralized, and they're basically acting all on their own. But all of the pressure to maintain this ever-changing universe that happens in the production system falls onto, as you say, a very small group of kind of band of heroes that's providing all this value for everyone downstream of them. I think that doesn't work. That can't scale.

I think that the way that data ultimately needs to be treated is the same way that product needs to be treated, where the people who want something, like the product manager who's creating the requirements and the spec, is communicating directly to the software engineer who's going to implement that spec. There is no one in the middle of that process. It's like looking at the requirements and figuring it out. The product teams are able to interact directly with other product teams. And I think that has to be the way forward with data. If you're a data scientist or you're an analyst, and something goes wrong with your data, you have to be able to communicate directly to the software engineer who caused that problem in the first place and vice versa.

So what I think data engineers should focus on is the mechanisms of how you do that. How are you able to facilitate that type of cross-team communication at scale? And I think there are some solutions out there to make it easier. But once you're in a world where these two groups are starting to talk to each other and becoming more aware of each other on a regular basis, that means the data engineer can focus on the thing that really makes them the hero, which is making sure that the infrastructure runs successfully.

Dayle Hall:
Yeah. That's a great way to end. I have one final question, we'll give you the last word. Obviously, with your role and the organizations you've worked in, you're in touch with a lot of new tech innovation, you've seen a lot. And obviously, now we have this gen AI taking off, and we'll see whether it becomes what it's promised to be moving forward, and all the challenges we're going to have with it. Is there something that you have seen or something that you've heard, or what's the most exciting thing that you're looking forward to in the next, I don't want to say two or three years because tech moves so fast, but in the next year or so? Is there something that you can see happening, or something around gen AI, or something that says, I'm really excited to see this because I think this is going to change the way we build software or the way enterprises run? What's that one thing that the listeners would take away?

Chad Sanderson:
This is going to sound boring, but I'm very much a foundations guy. I think if you have the foundations, it unlocks so much for the future. The promise- let me rephrase this. The reason that a company like GitHub can launch Copilot is because they have access to a tremendous repository of information about code, and they have a mapping through documentation of what that code is actually doing. That is like the process of creating context for an AI system. You have an object, but without any context, that object is meaningless. It's just noise. But you also have something that reflects what that underlying meaning is. And once you have those two things paired together, now you can turn over a tool to a software engineer, and as they’re writing code, the machine can say, oh, I understand what you're doing because I've seen this before, and I know what this operation is, and here's what I think you would like to do.

The data world, one of the reasons that we can't do that as easily is because it's not so much about the code. It's about the metadata, the way that someone who is trying to build an AI system that is assisted by AI. You're in AI is essentially helping you build a machine learning model, which I think is a holy grail for data science. That model needs to understand what is all the data at your company? How does it interconnect? What does it mean?

And so when you say, hey, I would like to train my machine learning model on the number of times that a customer has logged into our platform over the last 30 days, ideally, what we would want this Copilot to do is say, oh, well, I know where that data is. I know exactly what you're talking about. Here it is. And I can help turn it into a feature on your behalf. I think that's the type of thing that's going to take AI development into the next stratosphere. It's going to make every single person who works with SQL into an AI developer. And I think it's totally possible to do.

Dayle Hall:
Yeah. As always, context is key, whether it's within data or with AI. Chad, I really appreciate the time. This was a great discussion. It's unlike any of the other podcasts I've done before, which is excellent because it will keep people enthralled. Thank you so much for your time.

Chad Sanderson:
Thank you. I've had a pleasure being here.

Dayle Hall:
Great. Thanks, everyone. We appreciate you listening, and please join us on the next one, and we'll speak to you next time.