May 13, 2021

How to Move Past the AI Metrics Trap

Cliff Massey

In this blog series, we’ve talked about how to get your AI project out of R&D by aligning your data to real-world specs and involving your SMEs early and often.

The last step to leaving R&D behind is building a model that’s “good enough.”

What makes a model “good enough” is not clearly defined, which makes it extremely difficult to know when your model is “done.” As we’ve talked about before, machine learning is an iterative process—it’s never really “done.” But, as the adage goes, you shouldn’t let perfect be the enemy of good.

First and foremost: statistically speaking, most people aren’t data scientists or machine learning engineers, and therefore most people don’t really understand the common metrics used for measuring model performance—your F1, precision, recall, mAP, etc. They may have a sense of how these stats work (.99 is better than .50 of course), but it’s very difficult to connect those data points back to their individual, day-to-day work.

So if you’re looking to get your AI project out of R&D, leading with these performance metrics will make that goal very difficult to achieve. Many people just do not understand accuracy; they think they can only accept a model with a .98 F1 score. After all, 98% is an A+, right?

Well, maybe not.

In order to be truly successful, at bare minimum, a model needs to reliably speed up or qualitatively improve the workflow of its intended users.

All of the other downstream functions are certainly nice to have (automated flagging, error reporting, record creation, etc.) but don’t mean much if the model doesn’t tangibly improve the work humans are currently using their eyeballs to do—or work that nobody’s doing at all!

Say you need to identify airplanes in daily takes of satellite imagery. Your AI model may only be 75% accurate according to F1 score. In some cases, it only labels part of the airplane, and it seems to do this most often when the plane is parked at an airport gate. But even if this is the case, the intel analyst who'd normally have to manually flag 100% of those planes herself now only needs to review the ~25% of examples where the model wasn’t sure that it had found an airplane.

That time savings is still very valuable—ask the analyst and she’ll probably agree! But if you’d just looked at F1 score alone, you might have discounted this model as a failure. Of course, you should continue to improve the model with user feedback, but that doesn’t mean the model has to stay in R&D.

Ultimately, your SMEs may decide your AI’s fate

Again, from our perspective, the real value AI can add is to improve operational outcomes—to improve efficiency and make life and work easier for people. Your subject matter experts (SMEs) are precisely the people whose lives and work you can improve!

No matter how good your model’s F1 score is in the lab, it won’t matter much at the end of the day if none of your SMEs are ready to adopt that model and find it useful. The best way to do that is to integrate into the tools and workflows people already use on a daily basis. Don’t make AI something people have to “go to” in order to use: bake it right into the tools they’re using already.

This is why we recommend measuring user satisfaction with a model in addition to your usual data science metrics. Do the intended users feel that the model is performing well? Has it sped up their workflow, if so, by how much? Is the work cognitively easier to manage? If so, how or why?

In some cases, a user may be more tolerant of some false positives if it means having little to no false negatives, or vice versa. For instance, when manufacturing syringes, you really don’t want to risk letting even a single defective sample get past your quality control—so you might find a model useful that over-detects defects a little (i.e. lower precision), if it means not missing anything (i.e. higher recall).

Metrics like F1 alone can't account for this user-centric nuance.

Often easier said than done

We get it: this part isn’t so easy. Company leadership or the Board is talking about an AI roadmap for the company, and innovation teams are on the front line of making those projects a reality. The higher-ups want to see hard metrics for success, and being able to show a model with .99 F1 running with four-nines uptime definitely looks good on paper!

While it’s not necessarily easy, we still believe it’s better to start educating your colleagues now about user impact and satisfaction scores as tangible, important metrics for measuring success of AI projects.

We believe you’ll see a stronger correlation between those scores and successful AI deployment than if you relied on something like F1 score alone!

How to Move Past the AI Metrics Trap

In order to be truly successful, at bare minimum, a model needs to reliably speed up or qualitatively improve the workflow of its intended users.

Ultimately, your SMEs may decide your AI’s fate

Often easier said than done

Get AI insights and best practices in your inbox

Scalable Data Labeling: It’s all about your strategy

CrowdAI, ChatGPT and GPT-3: Empowering the Business Analyst

Can Novel Foundation Models Play a Role in National Security?