Q&A: Expert tips for running machine learning in production
In this interview, 'Designing Machine Learning Systems' author Chip Huyen shares advice and best practices for building and maintaining ML systems in real-world contexts.
Building a machine learning model in an academic research context is already challenging. But the messiness and fluidity of real-world data and business objectives adds a layer of complexity that's not always addressed in data science and ML programs.
In Designing Machine Learning Systems, published by O'Reilly Media, author Chip Huyen presents an accessible yet comprehensive overview of what goes into designing flexible, reliable ML systems. Huyen, a Stanford-trained computer scientist and founder of ML platform startup Claypot AI, shares advice for building, deploying and maintaining ML systems that can adapt to the changing nature of production environments.
In this Q&A with TechTarget Editorial, Huyen discusses her writing process, common challenges when running ML in production and how the generative AI boom has affected her thinking about the book. For a taste of Designing Machine Learning Systems, read an excerpt from Chapter 1, "Overview of Machine Learning Systems," where Huyen explains the nuances that differentiate ML from traditional software.
Editor's note: This Q&A has been edited for clarity and conciseness.
What motivated you to write Designing Machine Learning Systems?
Chip Huyen: I don't think it was a conscious decision. I didn't set out to write a book -- it was more of an evolution.
Maybe because I come from a writing background, I really love taking notes. And so during my first job after college at Nvidia, I started making a ton of notes. I was talking to a lot of people about all the things that I saw, our customers and what challenges they faced. I actually first published an open source note on GitHub, like 8,000 words on the very basics of the frameworks that I saw for how to deploy ML models, and it somehow got pretty popular.
Then, I went back to Stanford and started teaching a course on [ML systems design]. I wanted to make sure that my students understood what I said, so I made a lot of lecture notes for them as well. Through the two iterations of the course, I got a lot of student feedback, a lot of reviewers. My professors gave feedback, other industry people that I know gave feedback. And eventually I said, 'Oh, wow, I actually now have a pretty comprehensive set of notes on this topic. Why not turn it into a book?' So it was more of a natural four-year process until I got to the book.
One key theme of your book is the importance of focusing on ML systems as a whole, not just models -- you write, 'The algorithm is only a small part of the actual system in production.' Could you expand on why that holistic view is so important?
Huyen: It's different for people who want to use a model in production versus people who want to make core algorithmic or research progress. A lot of courses I took in school, I think they are really great courses, but I do think there's a very sharp focus on the algorithm part -- the modeling part. It was very useful to learn about that. But then I realized, when we started helping companies deploy these models, this is not enough.
One thing, for example, is organizational structure. So to deploy a model, you create a model and hand it to another team. And that team has no idea what to do, right? They treat it the same way they treat traditional software, whereas there's a lot of differences between ML models and traditional software. And because of that, I think we need to put more guardrails in the process to make sure that ML models can perform well in production.
What are some of the challenges that come up when operationalizing and maintaining ML systems in production over the long term versus, as you said, working in an academic context?
Huyen: One thing is the question of model performance. For example, in school, when you're doing research, what you care about is this kind of leaderboard-style competition. But there's a very big gap between a model that can do well on the leaderboard-style competitions versus a model that does well in production.
For one thing, when you do a standardized task, a lot of things are well understood. Data is pretty much cleaned. You know exactly what it's like. It doesn't change over time. You can probably even find some script somebody wrote for you to deal with that. So the data part is pretty much done for you, whereas the data is a huge challenge in production.
Another issue is latency. When you just care about model performance, accuracy or F1 score, you don't care about how long it's going to take to generate predictions, whereas in production, you don't want users to wait.
Also, the metrics. Usually, you have a concrete ground-truth label to compare the model predictions with in those leaderboard-style competitions or in research. But in production, the vast majority of the time, we don't have all of these ground-truth labels. So how do you monitor the model in production?
And, of course, the world changes. So many things are happening nowadays, things change very fast, and any AI needs to be able to change with the times. How do we keep updating the model effectively? All these questions that you don't get to answer in just a research environment or in class.
Are there common mistakes or misconceptions that you see come up a lot when designing ML systems for production?
Huyen: One key thing is to figure out a way to consistently evaluate your model and let people compare a new iteration to the last iteration. And it's very hard. In a lot of businesses, an ML initiative needs to be tied to a business objective. For a lot of companies, the objective is revenue, profit -- business metrics. But if you use only business metrics, it's going to make it impossible, because a company may have a thousand initiatives at the same time.
Say I want to try to boost revenue. If the revenue goes up, you don't know if it's the [current] model or some other model. Or if the model's F1 score or accuracy goes up, you're not sure if that will lead to the business revenue going up.
We have teams saying, 'We can predict 96% accuracy now versus 95.8% accuracy before. We should totally deploy [the new model].' And then when you talk to the business people, they're like, 'Actually, most users do not notice a difference between 96% and 95.8%.' So what is the point of doing it, you know? It's very hard, finding the right tradeoff.
Or say you run an e-commerce website, and you want to increase the purchase rate. You have this hypothesis that if people see the thing that they like, they're going to buy it more often, so you want the recommender system to recommend items that users might like. The assumption here is that the more users click on [a product], the more people will buy it. And so you use this metric of click-through rate.
Now, imagine that the model is doing very well through that lens of click-through rate. But what if the business metric still doesn't go up? There's a chance that the ML model metrics and the business metrics don't line up. Then what do you do about that?
Another is setting up data infrastructure. I think data infrastructure is glossed over in a lot of classes. Learning how to operate, build or maintain data systems takes a lot of time. Maybe it doesn't fit into 10 weeks or 16 weeks of coursework. You also want students to enjoy it, and a lot of the time, data infrastructure is just not enjoyable. It's not fun, right? It's painful.
So I see it being glossed over a lot, but it's a big thing. AI and ML today depend on data. It doesn't matter how fancy somebody's prototype running in a Jupyter notebook is going to be -- if it cannot access the data fast enough or in the right way, it's not going to be useful in production. AI strategy has to start from data strategy.
It's funny, people keep asking me, 'So the book was published before ChatGPT came out. Did generative AI change a lot? Change everything?' And, actually, the fundamentals stay the same with generative AI. Even though ChatGPT is new, it's built on existing technology. It's not something that just came out of the blue, you know? Language models have been around since the 1950s. Embeddings have been around since the early 2000s. Vector databases have been around for a few years. So a lot of this has remained pretty much the same.
Yeah, it seems like a lot of what's changed has been the scale or the level of attention, rather than the underlying technologies. How have you seen the generative AI boom affecting the industry and ML roles?
Huyen: Actually, I feel like generative AI made a lot of the focus points of the book more relevant. Generative AI turned models into an API. You don't really build models from scratch anymore. It's good to know algorithms, for sure -- I do encourage the people interested in doing research to definitely learn those skills -- but a lot of people just use these models as an API. They don't care about how it's being run under the hood.
So I do think that it actually highlighted AI production, because the rest of it is about the rest of the system. Like data infrastructure, evaluating AI for business use cases, focusing on user experience, organizational structure, building platforms so that people can automate a lot of their work and evaluations. A lot of this is actually more important now that the model part is being commercialized away [through APIs].
Another thing is, for a long time, we said that ML should be more interactive. In the past, a lot of people followed this kind of batch prediction [approach]. Say, for Uber Eats or DoorDash, when you log in to the app, you see a lot of recommendations for what restaurants you might enjoy. In the past, a lot of this had been pre-computed for you every day. So if this company has 100 million users, they're generating recommendations for 100 million users.
But then these predictions can get stale. Maybe yesterday I enjoyed Italian, but now today I want Vietnamese food. And it costs money. Of 100 million users, not all of them are going to log in every day. If only 2% of them are going to log in each day, and you generate [predictions] for 100 million users, 98 million predictions are going to be wasted.
Chip HuyenAuthor, 'Designing Machine Learning Systems'
And then, of course, generative AI came out. With ChatGPT, everything is basically real-time prediction, right? You send in a request, you get back a prediction that has the impression of being generated on the fly. And now people just take it for granted. We don't need to convince people of that anymore.
Or another aspect is model drift. Things change over time. So sometimes you ask ChatGPT, 'Hey, tell me about this or that,' and it's like, 'Oh, as of my knowledge cutoff of September 2021, I cannot do that.' How do we keep models up to date with the changing world? That is why it's important to monitor and continually update the model over time.
So I think generative AI actually highlighted a lot of challenges with ML in production.
It sounds like some of the things that you're drawing out there are broader trends that we're seeing across software and IT -- the overall trend toward abstracting away some of that underlying infrastructure, the shift toward platforms. It's interesting to see, in some ways, ML and MLOps becoming much more similar to traditional software and DevOps.
Huyen: Yeah, the new thing that generative AI brought is that it's so easy to use. A lot of people traditionally have not been able to build applications because they lack engineering knowledge or ML knowledge. Now they can. We see a lot of tools coming out to help these people build very low-code [applications]. I think it's exciting. I feel like now, people with less AI knowledge but more domain expertise can bring AI to the vast majority of industries and many, many use cases. It's very exciting.
At the same time, we can't expect people to know everything, right? People who are really good at photo editing or movie editing might not be the best at engineering practices, like putting in guardrails and making sure the model performs, stays up to date, and is reliable and robust. So I think we should see a lot of movement of people building tools to make it very hard for people without engineering knowledge to make mistakes in creating these applications.