Previously, we discussed the two important reasons as to why AI projects get stuck in R&D. One of the main reasons is that projects are scoped only to available data. This is also known as the “I have some data, let me see what I can do with it” problem. It’s perfectly possible to train a model that performs very well on a limited dataset, but that model then doesn’t translate well to a real-world environment.
In brief: just having access to some test data isn’t sufficient to produce a working AI model. If you’re going to get your AI out of R&D, you’ll need to prove to yourself and others that you can create a model that solves a real operational problem! For that, we recommend starting from the end and working backwards: think about your ideal end-goal first, then design your AI strategy to meet that.
Here are some things to consider as you work to create a model that can operate on real-world data.
1. Do your test media (images, video) match your production media as closely as possible?
Machine learning is just like a science experiment: you want to try to control for as many variables as possible while collecting your data.
Controls such as sensor type, lighting, camera angle, zoom level, and how the media are processed should be set prior to image collection. This way, all your media are collected in a consistent way and to parameters you want. Try to match these controls to your production environment. For example, if your camera is outdoors, lighting conditions will vary throughout the day. Match your test media setup to account for this as much as possible.
When possible, use real production media for your test. This way, the model you create will be much more likely to show results in a real production environment! Open source datasets are great academic resources, but we find they lack the nuance needed for real production AI.
2. Do you have enough examples of the good, the bad, and the ugly?
This is absolutely key, as you need to expose the model to lots of examples of what you are looking for (i.e. positive samples) and—equally important—what you are not looking for (i.e. negative samples), so the model can learn to tell the difference.
Say you’re building a model to look for defects in products coming off your production line. (We love this use case!) “Defect” is a pretty nebulous term: what specifically are you looking for? Cracks? Dents? Something unique to your situation? Think of training the model just like training a new intern to do the same thing: you’d need to show them plenty of examples of each defect, as well as some examples of defect-free products so they know the difference.
Try to get at least 200 examples of each positive sample (e.g. type of defect) from your real production environment. The harder the vision problem—consider factors like low quality imagery or similar-looking positive samples—the more data you’re going to need. More is (almost) always better, but 200 is a great rule-of-thumb to at least get you started. If you don’t have enough examples, talk to the people in your organization who can help you collect more from the real production environment!
3. Remember: just because you are continuously training doesn’t mean you are stuck in R&D!
Machine learning is a continuous process: train data, deploy a model, collect more data, repeat.
It’s sort of like having a high-performance car: you want to keep the engine tuned and maintain the machine so it’s always performing at its best. Constantly re-training and fine-tuning your model by adding more data should continue to happen after leaving R&D.
So don’t lose hope! If you are exposing your model to real production data and it’s working—even if it’s not perfect yet—your project has already left R&D in our opinion!