Adaptive Inference: Why Taking a Ferrari for Groceries is a Terrible Idea

Track:
Machine Learning: Research & Applications
Type:
Talk
Level:
intermediate
Duration:
30 minutes

Abstract

Deploying machine learning models for real-world tasks is expensive—especially for inference. Unlike training, inference isn’t a one-and-done deal; it’s a recurring cost that grows with every prediction you make. And naturally, if you want accurate results, you’re probably calling up the biggest, most powerful model in your arsenal. But here’s the problem: these large models are resource-hungry, and most inputs don’t even need their full power. In fact, research shows that small, efficient models can handle a good chunk of your tasks just fine.

So why bring a Ferrari to pick up groceries? Adaptive inference offers a smarter solution: instead of using one oversized model for everything, it dynamically selects which model to use based on task difficulty. For simple inputs, you call smaller, cheaper models. For harder tasks, you escalate to the big guns. The result? High accuracy without blowing your compute budget.

This talk will cover:

  • Why using one large model for all tasks is overkill (and expensive).
  • How adaptive inference works and practical strategies for task routing.
  • Challenges in estimating task difficulty and balancing latency with accuracy.
  • Real-world examples of cost savings, from edge-to-cloud setups to large language model APIs.

To make this tangible, I’ll share how Agreement-Based Cascading (ABC) uses ensemble agreement for routing decisions. By letting models decide when they’re needed, ABC saves costs in edge-to-cloud deployments, reduces GPU rental and API bills, and outperforms state-of-the-art methods—all while staying intuitive and efficient.

Whether you’re an ML engineer deploying models, a researcher curious about efficient inference, or just someone who loves learning how to save money while staying performant, this talk has something for you.