Episode 522: Noah Gift on MLOps : Software Engineering Radio

Noah Gift, author of Practical MLOps, discusses tools and techniques used to operationalize machine learning applications. Host Akshay Manchale speaks with him about the foundational aspects of MLOps, such as basic automation through DevOps, as well as data operations and platform operations needed for building and operating machine learning applications at different levels of scale. Noah discusses using the cloud for quick experimentation with models and the importance of CI/CD and monitoring to continuously improve and keep checks on the performance of machine learning mode accuracy. They also explore the regulatory and ethical considerations that are important in building useful machine learning applications at scale.

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Akshay Manchale 00:00:16 Welcome to Software Engineering Radio. I’m your host, Akshay Manchale. My guest today is Noah Gift, and we’ll be talking about MLOps. Noah Gift is an executive in residence at the Duke MIDS Data Science and AI Product Innovation Programs and teaches MLOps, Data Engineering, Cloud Computing, and SO Entrepreneurship. He’s the author of multiple technical publications, including recent books, Practical MLOps., which this episode will get into, Python for DevOps. among others. Noah is also the founder of pragmatic AI labs, which DevOps technical content around MLOps, DevOps, data science and Cloud Computing. Noah, welcome to the show.

Noah Gift 00:00:53 Hi, happy to be here.

Akshay Manchale 00:00:55 So to set the context for rest of our episode, can you briefly describe what is MLOps?

Noah Gift 00:01:02 Yeah, I would describe MLOps as a combination of four different items. One would be DevOps. I would say that’s about 25% of it. The other 25% would be data engineering or DataOps. The other 25% would be modeling. So things like you do on Kaggle and then the other 25% would be business — so, product management, essentially knowing what it is you’re solving. I would describe it as a combination of those four things.

Akshay Manchale 00:01:34 And how do you see that differ from DevOps in general? Because you said DevOps was like a part of it. So where’s the difference beyond DevOps there?

Noah Gift 00:01:44 Yeah. So in terms of DevOps, really the concept is fairly straightforward. It’s the idea of automating your software infrastructure so that you’re able to rapidly release changes. You’re building evolutionary architecture and you’re able to use the Cloud, for example, to do infrastructure as code and to use virtualization. So really it’s the idea of having an iterative, agile environment where there are very few manual components. And I think many organizations understand that and they’re doing DevOps. I mean, it took a while for organizations to fully adopt it, but many people are doing this, but in terms of machine learning operations, there’s a few wild cards here. And one of them is that if you don’t have data, it’s very difficult to do machine learning operations. So you need to have some kind of a pipeline for data. And I would compare this a lot, like the water system in a city where you can’t have a dishwasher or a washing machine or a swimming pool, if you don’t have water hookup, and treatment plants, where once the water has been something’s been done with it, you’re able to process it.

Noah Gift 00:03:00 And if you don’t have that data pipeline set up, you’re not going to be able to do a lot. And then likewise, what’s a little bit different versus DevOps is that there are new things. So if it’s just DevOps, you could be, I don’t know, deploying mobile applications. And there are some interesting things about that, but it’s fairly well known now, but with machine learning, you’re going to deal with things like models, and the models could introduce another basically component that has to be watched. So for example, is the model accurately performing in production? Has the data changed a lot since the last time you trained the model and, and so you have to add new characteristics. So in some sense, there’s a lot of similarity to DevOps, but the main thing is that there’s new components that have to be treated in a similar fashion as what you’ve done in the past.

Noah Gift 00:03:54 I think in some sense, like going from web development to mobile development, there could be some similarity there in that if anyone remembers, when you first got into web development, there’s kind of the classic things of, there’s JavaScript and HTML and a relational database, but then when you get into mobile, it’s like, oh, wow, there’s a new thing. Now we have to do swift code or objective C code, or we have to use Android. And then, I have to deal with different things. Like how do I deploy my mobile device? And so in some sense, it’s just another component, but it has to be treated in a unique way that the properties of that component have to be respected and taken care of. And that they’re a little bit different, just like web development has some similarity to mobile development, but it’s not the same. There are some very unique differences,

Akshay Manchale 00:04:44 Right. In your book, you talk about reaching the true potential of machine learning is dependent on a couple of fundamental things being present already. And you compare this with mass loss hierarchy of needs in order for humans or anyone to reach food potential. You need food, water, safety, and so on up until like the full potential is really at the top of that pyramid, so to speak. So what is this hierarchy of needs for machine learning to be successful? What are those layers that build onto a successful machine learning organization or product?

Noah Gift 00:05:16 Yeah, so I would say to start with the foundational layer is DevOps. And I think if your company is already in the software space doing, let’s say software as a service, it’s very likely that your company has very strong DevOps capabilities for one, you probably won’t, well, you wouldn’t have survived if you didn’t have DevOps capabilities. When I was first working in the software industry in the Bay area, many of the companies I went to did not have DevOps, and that’s what I helped them implement. And it really is a big problem to not have DevOps. Now, if you’re in the data science world or coming from academics, DevOps may be something you really don’t have any familiarity with. And so in that scenario, if you’re at a startup and everybody is just from university and they’re used to using Jupyter notebooks, they could be in for a rude surprise in the fact that they need to implement DevOps and DevOps, again, automation testing, continuous integration, continuous delivery using Cloud Computing, using microservices.

Noah Gift 00:06:22 If you don’t have those capabilities already in your organization, you’re really going to need to build those. So that is the foundational layer. As I mentioned, depends on where you’re coming from, you may already have it. Now the next layer would be now if you’re a software engineering shop, it’s possible that even though you’re really good at software engineering, you may not be good at the next layer, which would be the data engineering. And so, building a data pipeline. And so now you may need to build a new capability and the new capability would be to move the data into the locations that needs to move, make sure that you’re able to automatically handle different processes that prepare the data for machine learning. I think what we’re seeing right now in the MLOps space is that many organizations are using something called a feature store.

Noah Gift 00:07:09 And that’s a data engineering best practice for MLOps, and many companies are now coming out with platforms that have feature stores. I know that Snowflake, which is a big data management tool, that’s publicly traded. They have implemented a feature store by buying a company that had that capability. I know Databricks, $10 billion company, they just implemented a feature store. SageMaker one of the biggest MLOps platforms they’ve entered used the feature store, Iguazio as a company that I’m an advisor to, they use a feature store. So basically, that’s the next evolution is, use the right tools for the job. Use data management processes, use the new systems that are being developed. Assuming you have that, then the next layer up would be the platform automation. And this is where I think it’s very easy for the data scientist to get themselves under trouble where maybe the software engineer would be a little better at understanding that, yeah, you do need to use a platform.

Noah Gift 00:08:08 Like if you take the C# developer who has been developing .net for 10 years or 20 years, they understand you need a platform. They have visual studio, they have .net. They have all these really awesome tools. And like, why would they not use all those tools? They make them more productive. And similarly with doing things in machine learning, my recommendation is that somebody picks a platform of some kind, it could be SageMaker for AWS. It could be Azure ML studio for Azure. It could be Databricks, if you want to do Spark based systems, whatever it is you’re deciding to pick, I’m more neutral on this, but you should use some platform so that you can focus on solving holistically the whole problem versus building out orchestration systems and distributed computing systems and monitoring systems and all these things that have nothing to do with MLOps by itself.

Noah Gift 00:09:03 So once you’ve got all that and you are using some platform, then at that point, I do believe you’re at the stage where MLOps is possible. The one last step though, would be that you need to make sure that there’s a good feedback loop with the stakeholders in your organization, like the product managers, the CEO, so that you’re able to formulate what it is you’re trying to build. So in this sense, it’s not that different than regular software engineering. I’ve made a lot of new products in my life. And one of the things that’s really critical is to work with the product managers to make sure that the thing you’re building actually makes sense. Like, is there ROI, can it make money? Can it solve problems for customers? So similarly, even though you can build something, just because you have the capabilities and you’ve done all the steps doesn’t mean necessarily you should without doing a little bit of due diligence, but yeah, that would be the foundation.

Akshay Manchale 00:09:56 Yeah. And I think when you mentioned feature stores, I want to add to our listeners, we did a recent episode on feature stores. I’ll leave a link to that in the show notes, if you want to go and listen to that. But continuing on with what you were saying, there’s a lot of different people involved in machine learning that you don’t normally see in just a traditional software shop that has some sort of DevOps thing in place. For example, maybe you are working in a product that is in the healthcare space, and you’re working with say radiologists who are reading x-rays and they’re contributing to your machine learning model or how you go about building machine learning. So, what are the challenges that, that sort of like different people with different skill sets, different background coming in to build machine learning applications? What are the kind of challenges that you run into when you have these diverse set of people working on machine learning products, which I think is increasingly common.

Noah Gift 00:10:52 Yeah. I think one of the problems is that there needs to be a production first mindset and that alone could solve a lot of issues. So if from the very beginning you’re using version control, you’re using continuous integration, you’re using a platform. I think all of those are some of the ways to add guard rails to the process. If from the very beginning, you have some people that have PhDs and they’re in the corner working with Jupyter notebook, and then you have some other people that are doing DevOps and using infrastructure as code. Then that definitely is going to cause a conflict at some point. It really has to be from the very beginning that you’re using this production first mindset. Now we’re seeing this actually with a lot of the evolution of the tooling. And I know SageMaker, I was just reading today, in fact that they have this whole concept of SageMaker projects and you build out the whole project as like a machine learning software engineering project.

Noah Gift 00:11:51 So I think those are some of the things that are, that would go a long way is, is making sure that you’re treating it like you would treat holistically something that’s going to go to production. So like, no one that’s a software engineer would basically just start. I mean, if you’re really a beginner and you’ve never had any experience, you would just start writing code without version control or tests or anything like that. Or like some kind of editor. But if you’re a professional, you would never do that. You would make sure that it was hooked up and you could continuously deploy your software. So similarly from the very beginning, you should not make a mess. You should build out a production-first mindset.

Akshay Manchale 00:12:28 Yeah. Can you comment a little more about the continuous integration aspect of it? I know there’s various layers in terms of, say, how your data interacts with it, but just in terms of just the model, which changes over time, it might be a statistical representation of signals that you’ve trained in the past and now you want to continuously improve. Maybe you want to go back to some version of the model. So how is that represented? How do you have version control and continuous integration on models itself?

Noah Gift 00:12:56 I would say the software part is the part that I would say the continuous integration, even though it’s a machine learning product, it doesn’t mean that the software went away. So the software still has to be tested and you still have to have linting and things like that. So, that’s where I was more referring to the continuous integration is that, regardless, there’ll be some microservice that’s going to be built, and it’ll have to have a model in there. Now, the stuff you bring up about the model versioning. Well, in that case, I think the scenario would be that you would just — like you would with any other kind of versioning system, like a Python package — you would pin the model version alongside the microservice, maybe build out a Docker container, and then potentially do some kind of integration test before you put that into production.

Noah Gift 00:13:45 That’s probably the approach I would use, which is you would merge this — pin the version number for the libraries, pin the version number for the model, and maybe even the version of the data, pin the version number, and then push that into, let’s say a staging branch by merging from the development branch to the staging branch going through, and then doing some kind of maybe a load test to verify that inference works at scale. And then also doing some kind of performance test that says, ‘okay, here’s the accuracy we would expect’ with some validation data. So you could do some of the same things that you would do with a regular software engineering project, but the functional tests are slightly different just in the fact that they’re also validating the accuracy of the model when it goes into production, which isn’t that dissimilar to some tests that would test the business logic.

Akshay Manchale 00:14:39 Data is really at the center of the model itself. Like, you have data that’s present to the company that access and put signals, maybe there’s data based on your interaction right now that comes into your model as an input signal. How do you reproduce your tests? When I build some sort of model right now, and I think the accuracy for that is, say, 60%, that depends on having some static data right now and that underlying data might change over time. So in the MLOps world, how do you plan for keeping tests that are reproducible, that you can actually rely on over time as you change things with respect to say the data pipelines, or even with respect to the model representation?

Noah Gift 00:15:25 I think there’s a lot of different ways that you could do that. One is that you could do data drift detection. So if the last time you trained your model, the data had maybe drifted more than 10% then potentially what you would do is just automatically trigger a new build of the model. And then you could do your integration test that verified that the model performance with the new trained model still worked pretty well. In addition to that, you could also, and I think this is more of a newer style, which is you could keep version copies of your data. So if you are using, let’s say a feature store, for example, that would be much easier to do data versioning with, right? because you’re actually versioning the features. And then you could say, well, at this point in time, this is what our accuracy was.

Noah Gift 00:16:16 Let’s go to the new version of the features and then let’s train a new model and see, is this better? And then you could even go back and you could mix and match. So, I think this is where the iteration of, I think the feature store really could be a very interesting component to a pipeline where you’re sifting the data to the point where it becomes more like something that you would keep in a versioned manner so that you can do things like retrain rapidly and verify that the accuracy is still good enough.

Akshay Manchale 00:16:50 What are some reasons why your accuracy might go down over time? Do you have any examples maybe?

Noah Gift 00:16:57 One example I had when I was working at a sports social media company that I was the CTO at, we initially were — this was 2013 and it’s actually amazing how much the world has changed with social media in the last 10 years — but a lot of the issues that we’re seeing today, actually we saw in social media at the time, like one of the issues is actually who is influential. And I think a couple days ago, Elon Musk was saying, are there bots on Twitter? Like, who’s really got followers? These are questions that we were dealing with 10 years ago. And one of the things that we discovered was that the engagement, relative engagement, was one of the stronger signals for basically influence. And what we did was, we trained models that would look at the relative engagement, but when we initially were training our models to figure out who to partner with — which was one of the machine learning jobs that I developed — initially, we didn’t have a ton of data because in order for us to figure out the signal we needed to first capture their relative engagement on multiple social media platforms, Twitter, Facebook, and even we used Wikipedia for this.

Noah Gift 00:18:16 In addition to that, we also needed to have actual data. And so it’s the whole cold start problem. So once they posted content onto our platform, then we were able to get some data, but if we didn’t have the data we had essentially a very, very small data set. And that’s a perfect example where when I first created the model, it was a lot different than the model when there was a lot of data, because which is now it’s pretty intuitive to everybody, but basically there’s a massive exponential relationship between somebody who’s just a regular person and let’s say, Ronaldo or something like that, or Beyonce or something like, they’re so far above that there needs to be like a power law relationship. And so if you’re, initially your model is predicting, let’s say more of a linear relationship because you just don’t have a lot of data and you just kept staying with that then that could be a real problem because your accuracy is going to be very, very different as more and more data kind of populates in.

Noah Gift 00:19:13 So that’s the perfect example of the data drift problem is that, Hey, we, for the first amount of people maybe were, they weren’t like huge influencers. The model was okay. But then all of a sudden, as we started to get some of these like superstars that came into our platform, we needed to basically retrain the model because the model just didn’t even work according to the new data that it solved.

Akshay Manchale 00:19:44 That seems like there is an urgency problem there where you detect some sort of data drift and your model accuracy is degrading and you really need to respond to that really quickly training a model might take a while. So what are some backstops that you might have to say, stick with the accuracy, maybe, or segment your users in a way where you get the same accuracy in, in the example that you were talking about, are there strategies to deal with to respond really quickly in the MLOps life cycle that lets you rapidly release something, rapidly release a fix, rapidly say cut off access to some data maybe that might be corrupting your model?

Noah Gift 00:20:24 I think it depends on a few different factors. So one would be in our case, we had a very static model creation system. The models would basically be retrained every night. So it wasn’t super sophisticated. I mean, back again 2013 was like the stone age of some of the stuff that’s happening with MLOps, but we would recreate a new model every night. But when you have a version model, you could always just go back in time and use a previous model that would’ve been more accurate. The other thing you could do is don’t use the newer model or don’t make decisions on the newer model. So it sort of kind of stayed with the older model. So for example, in our situation, the reason why the model was so important was we used it to pay people. And so we were, we’re essentially figuring out who would be successful.

Noah Gift 00:21:19 And it was actually a way to bypass traditional advertising to grow our platform. And in fact, it was very effective. A lot of people waste a lot of money on buying ads on their platform to do user growth. But we actually just went struck straight to influencers, figured out how much we should pay them and then had them create content for a platform. And in that scenario, once we got into a very new set of users, where there was really our model didn’t understand yet how to interact with them, probably the best way to approach that would be to not let the model make any predictions, but to do more of like a naive forecast. So you could just say, look I’m going to pay you, I don’t know, $500 versus I’m going to try to predict what to pay you.

Noah Gift 00:22:12 You just pay somebody like a flat rate. That’s like maybe the average you pay all of the people that you’re paying so that you can collect some data. So in that kind of scenario I think that’s important to not get too confident and say, oh great, we have this model that’s working so amazing. And then all of a sudden you get new signals that you really don’t know how to interpret yet. Especially if there’s money involved or human life involved, it may be better to just do a very cautious approach, which is again like, hey we’ll give you just this fixed amount of money to just see what happens. And then later, maybe a year later you can actually create a model. So I think that might be the way that I would approach one of those kinds of problems, is use an old model and then don’t make decisions on the new data yet until you have more data

Akshay Manchale 00:22:58 With respect to just testing and deployment, AB testing is like a popular way to deploy new features into your production users when it comes to machine learning, do you have similar patterns? I know what you just described is a form of like, say AB testing, arguably like you have one out there and the other one, you’re just observing how it does, but are there other strategies for testing to see how well models are going to behave as you make changes to it?

Noah Gift 00:23:25 I mean I think the AB testing strategy is a pretty good strategy. I mean, you could also do a percentage though, too. You could do an AB testing where the weight of the new model is very low, which I think if there’s money or human life at stake, then that might be a good strategy, right? It’s like why rush into things? Maybe what you do is you just throw two or three or four models out. And maybe the primary model still is at 95%. And then there’s four other models that are 1% of the traffic and you just collect the data to see how it’s performing. And then if one of them does appear over time to be an improvement and you’re able to figure out why it’s an improvement, then you can promote that model and then degrade the other models.

Akshay Manchale 00:24:53 So let’s talk a little bit about failure handling, right? So when you look at machine learning applications, that’re built on various layers of foundational services. You have your DataOps, you have your Platform Ops. In what ways can you see failures? Of course, you can see failures in each of those layers, but how do you respond to those failures? How do you keep your model up and running? And is there a way to tell just a failure of something downstream from failure of models, prediction itself?

Noah Gift 00:25:22 One thing to consider is that many people don’t treat data science or machine learning like data science. There’s like a meta data science layer, which is kind of surprising, right? Is if you are deploying something into production and you’re looking at the data, there’s a word for this, it’s called data science, right? Like if you’re a software engineer and you have log files and you’re using the logs to look statistical decisions about what you’re doing, that is data science, there’s no other way to put it, but monitoring logging instrumentation is data science. So I would say that you need to also at a meta layer, apply data science to what it is you’re doing at each layer. Look at it, have dashboards that can show that the differences. So I think that’s just a no brainer that again, if you only have experience with Jupyter notebooks, this may be new to you that people have been looking at logs for decades.

Noah Gift 00:26:16 I mean, in fact, multiple decades, this is one, a classic problem. Pre-internet even people were looking at logs and kind of sorting data and things like that. And even in like news groups where a bulletin board service a BBS, I was on those when I was in junior high, actually like when I was like 10, I was on like text based terminals. People were looking at log files. So I would say data science is definitely their approach to use for this. And then also I think there’s the business side, which would be kind of high level, which is if you deploy a model into a production, are you actually looking at what’s happening? And I think a really good example of this actually is social media. And I think this is a, hopefully researchers will really dig into this more.

Noah Gift 00:27:05 I’ve seen some great stuff about this, but this concept of the recommendation engine is I think a perfect example of this where, this was a big deal for a long time. Yes. Recommendation engines. We love recommendation engines. And one of the things I think that has really been a problem with recommendation engines is we’re starting to now realize that there are unintended consequences of a recommendation engine and many of them are very bad, right? So there is harm to society getting people harmful information or recommending it to them because it increases engagement. So I think those are things that are really important to look at from a stakeholder perspective. And you can see there’s some company structures like court B structure, where they talk about this. Like, what is your impact on societal cohesion? I think those are some things that should be looked at like how much revenue is your model making?

Noah Gift 00:28:03 Is it actually doing things that are helpful to people? Is it harming humans at scale? Is it really something we even need to do? Like, I mean, I think you could make the argument that many companies that do recommendations of scale, YouTube, Facebook, those Twitter that you could even make the argument, like maybe they should turn off all recommendations, right? Like, are they actually, do we really know the impact on those? So I think that’s another thing to just put into the situation is once the model’s been deployed, should you be prepared to just turn it off because it’s not having on one level, a surface level, it may be performing the way you expect, but actually what if it’s not doing what you expected at a, like a more holistic level and what can you do to mitigate that?

Akshay Manchale 00:28:54 I think that’s a really good point about just responsible AI or ethical AI that’s being talked about right now. So if you look at MLOps, as something similar to software development, you have a life cycle of software development, maybe Waterfall, Agile, whatever you’re doing, and you have a way of doing MLOps. At what point, at what stages do you consciously think about, say the ethical considerations of what you’re trying to build in this whole, like life cycle of building a machine learning application?

Noah Gift 00:29:24 For me personally, one of the things I’m trying to promote is the concept of, are you harming humans at scale? Are you neutral or are you helping humans at scale? And that’s the framework. I think that is pretty straight forward, right? Is, and if we look at, social media companies, and I think there’s a big documentary about this, the social dilemma that YouTube had at one point served out more traffic to Alex Jones than all of the major newspapers in the world, right? I mean, that to me is very clear. That’s harming humans at scale and they made a lot of money based on putting ads on that. I hope someday there’s a reckoning for that. And similarly with companies like Facebook, they’re still to this day, we don’t know all the different things they’re doing. But recommending, I think during the January 6th riot or around then, I don’t remember all the details, but that they were actually recommending like body armor and weapons to people.

Noah Gift 00:30:24 And we obviously see from recent events that people do actually act on those things. They buy body armor, weapons and do things. So there’s not like a theoretical connecting the dots, but there’s actual connecting to the dots. I think that would be something I hope new people to the industry who are talented look at as ask yourself that question, am I neutral? Am I harming humans at scale or am I helping them? And I think there’s this belief that you don’t have to care about that for some reason there’s certain segments of the tech industry. I don’t understand why you think you don’t need to know about this because it’s the world you live in. And I think it is important for people to say I want to be careful about what it is I’m working on.

Noah Gift 00:31:14 I mean, here’s a good example. Let’s take a company like Coursera, which I do a lot of work with. They’re a Corp B certified company. Please tell me something they’re doing, that’s harming humans, or even neutral, even. They’re definitely not neutral. And they’re definitely not harming humans. They’re helping humans at scale, right? That’s a pretty clear example of like you’re teaching people new things that help them make more money and it’s free, right? Like you can audit Coursera for free. Like, I mean, that’s unambiguously good. And then you can also find examples, like I don’t know, making dirty bombs that get put into land mines or something like that’s unambiguously bad. Like you’re hurting people. So I think that’s really something. I hope more people look at it and not push into like a political Republican-Democrat, whatever viewpoint, because it’s not, it’s a fact either. You’re helping, you’re neutral or you’re harming. And I think that framework is a good framework to consider.

Akshay Manchale 00:32:15 Yeah. I want to switch gears a little bit into just running machine learning models and production. So what does the runtime look like for machine learning? If you are, say a small company versus a very large company, what are the options for where you can run machine learning models and how does that impact your revenue maybe, or how quick you can run or how quickly you can iterate, et cetera.

Noah Gift 00:32:38 Yeah. I think this is a good question you bring up because just like how, if you were going to build maybe a house, it would be a different tool chain than if you were going to build a major, a skyscraper, right? Or a condominium tower, you would potentially have very different machinery. Or if you’re going to build a bike shed in your backyard, maybe you don’t need any tools you just need, like, I don’t know, like something you bought a shed and you just literally plop it down. I think that is important for companies to think about is before you start copying the practices of let’s say Google or some large company to really consider, do you need to do the things that the big company are doing? Or in the case of a smaller company, it might be better for you to use a pre-trained model, right?

Noah Gift 00:33:29 There’s tons of pre-trained models and it would just not be possible for you to get the same level of results. And maybe the pre-trained model is exactly what you need. So why not start there? Or auto ML would be another one. If you’re more of a medium sized company then potentially I would maybe start to recommend heavily looking at using a platform, people in your organization certified in the platform and organizing your workflow around the platform. And then if you’re a very large company like a top five company or something like this, that’s when they start to develop their own infrastructure where the core infrastructure that a medium company would use may not work. And you’ll see like a lot of technology platforms get developed by people who are at one of these companies where they have their own data center. So they can’t use AWS for example. And so then they build their own infrastructure. So you could probably break things into those three different categories.

Akshay Manchale 00:34:29 And if you’re a small company, maybe you just said, auto ML, can you talk more about auto ML?

Noah Gift 00:34:34 Yeah. So auto ML, really the idea here is that you’re using high level tools to train a model, a bespoke model. And there’s a lot of variation in, in how much auto ML is actually fully doing the job for you. But I mean because it could kind of mean lots of different things, but in general, the concept is you take your data, you feed it into a high-level system. You tell it what target you want to predict. And then you run something, you click a button and it plugs away at the problem and then gives you back a model. So in that sense, auto ML, I think can be a very good solution for many organizations. And there does appear to be traction with auto ML from every single platform. One of my favorite auto ML solutions is actually from Apple and it’s called Create ML

Akshay Manchale 00:35:28 In your book. You talk about another thing called Kaizen ML in contrasting with principles of Kaizen. So what is Kaizen ML? How do you practice it?

Noah Gift 00:35:37 Yeah. So basically my point in bringing up Kaizen ML is that I think it’s easy to get distracted with and people even get upset when you talk about auto ML. It’s like, Oh, you’re going to automate my job. And people get really worried because what they do with Kaggle, they really like, and then they enjoy it. But my point is that like Kaizen ML would be more of thinking holistically, like look, we’re going to automate every possible thing that is automatable. It could be hyper parameter tuning. It could be the trying different kinds of experiments. But the idea is you’re not really caring necessarily what the approach is. It could be a whole group of different techniques, but you’ll use the thing that helps you automate as much as possible to get to the end solution.

Akshay Manchale 00:36:27 Okay. And just in terms of just bootstrapping some sort of a machine learning solution, I think there are two approaches. One is you do it data centric way, or maybe you start with a model in mind and you do it in a model centric way. Can you talk about what the differences are starting one versus the other and how it might be advantages for say a small shop versus like a large shop that should do it completely differently?

Noah Gift 00:36:52 It’s interesting because the data centric versus model centric argument is, I don’t know if I buy that actually. So I think more in terms of the rule of 25%, where to me, it feels like you may be overestimating the organization’s ability to do DevOps and you also may be overestimating your organization’s ability to do product management. And so I think a better approach versus model versus data centric is that all those four quadrants are equally treated. So for example, you have to do maybe a maturity analysis and look at the beginning and say, Look, do we even have DevOps? If you don’t, who cares about model centric or data centric, you’re going to fail, right? And then look at the data. Like, do we have any kind of data automation? Well if you don’t , then you’ll fail.

Noah Gift 00:37:42 And then once you have some of those foundational pieces, then the other part is even if you want to be more data centric or more model centric and there’s pros and cons of both, you still, if you’re not identifying the correct business use case, you’ll also will fail. So that’s why, I mean, my view is a very different view than like an expert like Andrew Yang, who is obviously very talented person, right, and has all kinds of experience but more in the academic world where my experience is like more blue collar in that, and that life spent a lot of my life with greasy hands, right? I’m like in the car, I’m building software solutions that I think that delineation between model centric and data centric is kind of theoretically interesting for a certain life cycle stage.

Noah Gift 00:38:33 But I would say that’s not the place to start. The place to start would be to holistically look at the problem, which is again, the rule 25%. Once you have that set up and you have all those components set up and you really have that feedback loop, then I could see someone making the argument that, which I don’t disagree with, which is what’s more important, the modeling or the data. Yeah, probably the data, right. Because the modeling, I can just click a button and I can train models. So why do I need to do that? Let’s get even better at massaging the data, but I just feel like it’s kind of misleading to lead with that. When the holistic approach I think is where probably people should start

Akshay Manchale 00:39:12 And let’s say you are taking a holistic approach to starting out. One of the choices that you might have is maybe you should be running this in the Cloud by using maybe an auto ML like solution, or maybe just because you want to have more compute power. How do you decide whether that’s kind of like the right approach compared to trying to do it onn-prem because your data might be in different places. Is that still a concern when you’re trying to look at it holistically to decide where you want to do your training or deployment, and at what point you actually like have that clarity to say one or the other.

Noah Gift 00:39:47 I think that it would potentially be a good idea to use the most popular solutions. So let’s just take from a data science perspective, who is the, the top Cloud provider? Well, it’s AWS. Okay. Well what is their product? They recommend SageMaker. Okay start there, right? Like that, that’s one really simple way to work. And then what is the document like literally the manual, like this is what I was growing up. This is the thing that people used to say to you before there was stack overflow. They would say RTFM read the manual with a little bit of cussing in there. And basically it’s like, that’s exactly what I recommend is use the largest platform on the largest Cloud and then just literally read their documentation and do exactly what they say. That’s probably one of the better approaches.

Noah Gift 00:40:36 I think I would be a little worried about On-Prem and dealing with that. I would probably recommend to somebody, why don’t you pick the smallest possible thing you can do? That’s not On-Prem initially, unless you really have deep expertise in like On-Prem and your experts that you’re doing world class, data engineering, then maybe, yeah, it doesn’t matter. You can do anything you’ll be successful, but if you’re kind of new and things are a little bit clunky, maybe just take a very, very, very tiny problem, like the smallest possible problem. Even so a problem that’s so tiny that it’s inconsequential whether it succeeds or fails, and then get like a pipeline working in the end again, using the most popular tools. And the reason I also mentioned the most popular tools is that it’s easy to hire people now. So you just go and say like, whatever the most popular, maybe in 10 years, AWS, won’t be the most popular. I would again say pick whatever the most popular tool is because the documentation will be there and it’s easy to hire people.

Akshay Manchale 00:41:35 What do you have to say about the interoperability concerns? You talk about it a little bit in the book about how critical that is. So maybe can you explain why it’s critical and let’s say you actually pick the most popular tool chain available. What do you have to do to make sure it is interoperable in the future?

Noah Gift 00:41:54 I think sometimes you don’t care. It’s a good problem to have is that you’re successful and you’re locked into the Cloud. I mean, I’m not a believer in lock in fears. I know many people are afraid of the lock in, but I think a bigger problem is does anything work? That’s probably the number one problem is, does anything work? And, and I would say maybe you don’t need it. Like you don’t need to care about in the short term first, try to make sure you get something that works. There’s an expression I use YAGNI, ìyou aren’t gonna need itî. Like I think a lot of times just get something working and see what happens. And if you need to change, maybe the future has changed at that point. And you just do the new thing.

Akshay Manchale 00:42:34 Yeah, that makes sense. And adding onto that, I think there’s some recommendations saying, Go with the microservices based approach. And if you ask a traditional software engineer, maybe there is some more skepticism at going with microservices, just because of the complexity. But I think you make an argument in the book in several places, how it might simplify things for machine learning. So can you talk a little bit about why you think it might simplify things in, especially in machine learning applications versus like traditional software?

Noah Gift 00:43:03 Yeah. I think that traditional object oriented monolithic kind of workflow is really good for things like, let’s say a mobile app, right? That could be a great example or a content management or a payroll system, or something like that, where there’s a lot of reasons why maybe a monolithic application would work very well and heavy, heavy object auditor programming would work very well. But I think in terms of the DevOps style, one of the recommendations is microservices because you can build things very quickly and test out those ideas. And also microservices, in some sense, kind of implicitly will use containers. It’s very difficult to pull out the idea of a container from a microservice. And then the nice thing about a container is that it has the run time along with the software. So I think the benefits are so great that it’s hard to ignore microservices. I mean the ability to package the run time alongside with the software and make a very small change, test it out and deploy. It really works well for machine learning

Akshay Manchale 00:44:12 When it comes to using data for your machine learning really like data is at the center of your application. In many ways, you have to be careful about how you use it. Because there are so many regulatory restrictions around how you use it or there’s governance around like what you can use, what you cannot use, right to forget, et cetera. So how do you go about approaching those limitations or rather regulations that you really have to like follow legally?

Noah Gift 00:44:40 Yeah. I mean that just really depends on the size of the organization, the problem they’re solving and also the jurisdiction that they’re in. I don’t think there’s a one size fits all solution there. You could make an argument that many companies collect too much data, so that’s one way to solve the problem is just don’t collect it, right? Like there may be no good reason to collect. For example, if you’re using a dating app, maybe you don’t need to store the data of the location of the users. Like why would you need that? It could only cause problems for people in the future. Like again, harming humans at scale. So just don’t do it. Another thing is maybe you don’t enter certain spaces that are heavily regulated. You just don’t, I don’t know, get into a place where you have to deal with that kind of regulation.

Noah Gift 00:45:31 Another one is also the type of data. So you could just not store ever as a practice, any personally identifiable information PII. So I think there’s mitigation strategies and part of it could just be being a lot more careful about what it is you collect and or what markets you choose to get into. I think also this concept of being a, a unicorn or being like a trillion dollar company or I think hopefully those days are over that everybody wants to be a billion dollar company. Maybe it’s okay to be a $10 million company. And so maybe instead you focus on less things and the things you do really well and you don’t care about becoming some huge company. And so maybe that’s another solution as well.

Akshay Manchale 00:46:18 Well I guess more data, more problems, but can you talk about security? Are there specific things that you would do to make sure that your model is secure, are something different that you wouldn’t otherwise do in traditional software that you have to do in machine learning or you don’t have to do in machine learning?

Noah Gift 00:46:37 Yeah. I think a couple things that come to mind is that if you’re training your model on data, that the public gives you, that could be dangerous. And in fact, I was at Tesla headquarters, I think it was October, so like maybe six to nine months ago for their AI day. And that was actually a question that was asked was what happens? Maybe I asked it, I don’t remember, but it was me or somebody like, Hey, well, are you sure people aren’t embedding stuff inside of your computer vision model that causes problems? And so the answer is, they said, we don’t know. And I mean, basically, and in fact they knew that like if you walked in front of like a Tesla and you had the word stop on your shirt or something like that, you could like cause it to like stop suddenly.

Noah Gift 00:47:31 So I think that is an area of concern, which is if maybe go again back to the data collection is be very careful training the model on data that was publicly put into the system, because if you don’t have control over it, somebody could be planting a back door into your system and just basically creating a zero day exploit for your system. So one solution could be, especially if you’re a smaller company is just use pre-train models, right. And actually focus on pre-train models that have a very good history of data governance and best practices. And you kind of like you drift off of their wave so you can leverage their capability. So there’s just a couple ideas that I had.

Akshay Manchale 00:48:16 Okay. And you said you’ve been doing this since like 2013, so I kind of want to like start wrapping up. What are the big changes you’ve seen since then? And what are the changes that you see going into the future in the next, like say five, six years?

Noah Gift 00:48:28 Yeah. I would say the big change that I saw in 2013 was that at the time when I was creating models, I was actually using R, even though I’ve done a lot of stuff with Python and I’ve done stuff with C# or other languages, but I was using R because it had some really good statistical libraries. And I liked the way the machine learning libraries worked. Just the libraries have just massively changed. That’s one huge change. The data collection systems, like I was using Jenkins to collect data. I mean, there’s things like Airflow now and all these really cool, sophisticated Databricks now has gotten a lot better. There’s all these sophisticated systems now that do data engineering. So I would say libraries and data. And then I would see the stuff that’s happening in the future is, and also platforms.

Noah Gift 00:49:16 So I would say the platforms are definitely becoming mature now. They just didn’t exist before, the libraries are becoming much better. And I think also serving is now becoming, I would say 2023 is probably where we’re going to see a huge emphasis on model serving where we we’re getting a little bit now, but that’s actually my focus is, model serving. And the reason why model serving, I think is so interesting is that we don’t yet have necessarily web frameworks that are designed for serving machine learning models. We have people essentially adopting and hacking together web frameworks like FAST-CPI or Flask that will kind of take a model and put it together. You see a little bit of this, like TensorFlow serving as an example. I know the ML run has some of this as well, but I think we’re going to see some really strong software engineering, best practices around model serving that make it way simpler. And that some of the things that you care about, like model accuracy and like lineage and all this stuff will kind of be baked into the model serving. And then I would also say auto ML. I think auto ML will be ubiquitous.

Akshay Manchale 00:50:31 Yeah. That would be great. Like just having that access to machine learning that you could just do at the click of a button and see if it does something. One last thing finally, how can our listeners reach you? I know you have a lot of like writings and videos and educational content that you put out there. So how can people reach you or get to know your content?

Noah Gift 00:50:51 Yeah. So if you just go to Noahgift.com, you can see most of the content, I published books, courses. LinkedIn, that’s the only social network I use. I don’t use Twitter or Facebook or Instagram. And also, if you go to Coursera or O’Reilly, there’s a lot of content that I have on both of those platforms.

Akshay Manchale 00:51:10 Excellent. Noah, thank you so much for coming on the show and talking about MLOps. This is Akshay Manchale for Software Engineering Radio. Thank you for listening.

[End of Audio]