Sounds impressive. But an accuracy assessment from a lab goes only so far. It says nothing of how the AI will perform in the chaos of a real-world environment, and this is what the Google Health team wanted to find out. Over several months they observed nurses conducting eye scans and interviewed them about their experiences using the new system. The feedback wasn’t entirely positive.

When it worked well, the AI did speed things up. But it sometimes failed to give a result at all. Like most image recognition systems, the deep-learning model had been trained on high-quality scans; to ensure accuracy, it was designed to reject images that fell below a certain threshold of quality. With nurses scanning dozens of patients an hour and often taking the photos in poor lighting conditions, more than a fifth of the images were rejected.

Patients whose images were kicked out of the system were told they would have to visit a specialist at another clinic on another day. If they found it hard to take time off work or did not have a car, this was obviously inconvenient. Nurses felt frustrated, especially when they believed the rejected scans showed no signs of disease and the follow-up appointments were unnecessary. They sometimes wasted time trying to retake or edit an image that the AI had rejected.

A nurse operates the retinal scanner, taking images of the back of a patient’s eye. (Google)

Because the system had to upload images to the cloud for processing, poor internet connections in several clinics also caused delays. “Patients like the instant results, but the internet is slow and patients then complain,” said one nurse. “They’ve been waiting here since 6 a.m., and for the first two hours we could only screen 10 patients.”

The Google Health team is now working with local medical staff to design new workflows. For example, nurses could be trained to use their own judgment in borderline cases. The model itself could also be tweaked to handle imperfect images better. 

Risking a backlash

“This is a crucial study for anybody interested in getting their hands dirty and actually implementing AI solutions in real-world settings,” says Hamid Tizhoosh at the University of Waterloo in Canada, who works on AI for medical imaging. Tizhoosh is very critical of what he sees as a rush to announce new AI tools in response to covid-19. In some cases tools are developed and models released by teams with no health-care expertise, he says. He sees the Google study as a timely reminder that establishing accuracy in a lab is just the first step.

Michael Abramoff, an eye doctor and computer scientist at the University of Iowa Hospitals and Clinics, has been developing an AI for diagnosing retinal disease for several years and is CEO of a spinoff startup called IDx Technologies, which has collaborated with IBM Watson. Abramoff has been a cheerleader for health-care AI in the past, but he also cautions against a rush, warning of a backlash if people have bad experiences with AI. “I’m so glad that Google shows they’re willing to look into the actual workflow in clinics,” he says. “There is much more to health care than algorithms.”

Abramoff also questions the usefulness of comparing AI tools with human specialists when it comes to accuracy. Of course, we don’t want an AI to make a bad call. But human doctors disagree all the time, he says—and that’s fine. An AI system needs to fit into a process where sources of uncertainty are discussed rather than simply rejected.

Get it right and the benefits could be huge. When it worked well, Beede and her colleagues saw how the AI made people who were good at their jobs even better. “There was one nurse that screened 1,000 patients on her own, and with this tool she’s unstoppable,” she says. “The patients didn’t really care that it was an AI rather than a human reading their images. They cared more about what their experience was going to be.”