Skip to main page content

An Ensemble of Deep Convolutional Neural Networks is More Accurate and Reliable than Board-certified Ophthalmologists at Detecting Multiple Diseases in Retinal

My Session Status

Paper Presentation | Présentation d'article
5:05 PM, Friday 16 Jun 2023 (3 minutes)
Québec City Convention Centre - Room 307 AB | Salle 307 AB

Authors: Jovi Chau-Yee Wong  1, Prashant U. Pandey2, Brian G. Ballios1, Panos G. Christakis1, Alexander J. Kaplan1, David J. Mathew1, Stephan Ong Tone1, Michael J. Wan1, Jonathan A. Micieli11Department of Ophthalmology and Vision Sciences, University of Toronto, 2School of Biomedical Engineering, University of British Columbia.

Author Disclosures: J.C. Wong:   None.  P.U. Pandey:   None.  B.G. Ballios:   None.  P.G. Christakis:   None.  A.J. Kaplan:   None.  D.J. Mathew:   None.  S. Ong Tone:   None.  M.J. Wan:   None.  J.A. Micieli:   None.


Abstract Body: 

Purpose:   To    develop an algorithm to classify common retinal pathologies accurately and reliably from fundus photographs and to validate its performance against human experts.   

Study Design:   We performed a prospective comparative evaluation of a diagnostic technology and compared it against human performance.   

Methods:   We trained a deep convolutional ensemble (DCE), an ensemble of five convolutional neural networks (CNNs), to classify retinal fundus photographs into the four classes. Image data included 43,055 fundus images from 12 public datasets, consisting of samples of diabetic retinopathy (DR), glaucoma, age-related macular degeneration (AMD), and normal eyes. The CNN architecture was based on the InceptionV3 model, and initial weights were pre-trained on the ImageNet dataset. Five trained ensembles were then tested on an ‘unseen’ set of 100 images. Seven board-certified ophthalmologists were asked to classify these test images. We measured classification performance through accuracy, F1-score, positive predictive value (PPV), sensitivity, and specificity. Reliability was measured through the agreement between confidence and accuracy of predictions.   

Results:   Board-certified ophthalmologists achieved a mean accuracy of 72.7% (SD: 6.0%) over all classes, while the DCE achieved a greater mean accuracy of 79.2% (SD: 2.3%,  p  = 0.03). The DCE also achieved a greater mean PPV ( p  = 0.0005), sensitivity ( p  = 0.03), specificity ( p  = 0.03), and F1-score ( p  = 0.02) than ophthalmologists over all classes. When performing analysis based on each class, the DCE had a statistically significant higher mean F1-score for DR classification compared to the ophthalmologists (76.8% vs. 57.5%;  p  = 0.01), and greater but statistically non-significant mean F-scores for glaucoma (83.9% vs. 75.7%;  p  = 0.10), AMD (85.9% vs. 85.2%;  p  = 0.69), and normal eyes (73.0% vs. 70.5%;  p  = 0.39). We also found that the DCE had better reliability than the ophthalmologists, with a greater mean agreement between accuracy and confident of 81.6% vs. 70.3% ( p  < 0.001).   

Conclusions:   We developed a deep learning model and found that it could more accurately and more reliably classify four categories of fundus images compared to board-certified ophthalmologists. This work provides proof of principle that an AI algorithm is capable of accurate and reliable recognition of multiple retinal diseases using fundus photographs only.

My Session Status

Send Feedback

Session detail
Allows attendees to send short textual feedback to the organizer for a session. This is only sent to the organizer and not the speakers.
To respect data privacy rules, this option only displays profiles of attendees who have chosen to share their profile information publicly.

Changes here will affect all session detail pages