An Ensemble of Deep Convolutional Neural Networks is More Accurate and Reliable than Board-certified Ophthalmologists at Detecting Multiple Diseases in Retinal
Mon statut pour la session
Author Block: Jovi Chau-Yee Wong 1, Prashant U. Pandey2, Brian G. Ballios1, Panos G. Christakis1, Alexander J. Kaplan1, David J. Mathew1, Stephan Ong Tone1, Michael J. Wan1, Jonathan A. Micieli1. 1Department of Ophthalmology and Vision Sciences, University of Toronto, 2School of Biomedical Engineering, University of British Columbia.
Author Disclosure Block: J.C. Wong: None. P.U. Pandey: None. B.G. Ballios: None. P.G. Christakis: None. A.J. Kaplan: None. D.J. Mathew: None. S. Ong Tone: None. M.J. Wan: None. J.A. Micieli: None.
Abstract Title: An Ensemble of Deep Convolutional Neural Networks is More Accurate and Reliable than Board-certified Ophthalmologists at Detecting Multiple Diseases in Retinal Fundus Photographs
Abstract Body: Purpose: To develop an algorithm to classify common retinal pathologies accurately and reliably from fundus photographs and to validate its performance against human experts. Study Design: We performed a prospective comparative evaluation of a diagnostic technology and compared it against human performance. Methods: We trained a deep convolutional ensemble (DCE), an ensemble of five convolutional neural networks (CNNs), to classify retinal fundus photographs into the four classes. Image data included 43,055 fundus images from 12 public datasets, consisting of samples of diabetic retinopathy (DR), glaucoma, age-related macular degeneration (AMD), and normal eyes. The CNN architecture was based on the InceptionV3 model, and initial weights were pre-trained on the ImageNet dataset. Five trained ensembles were then tested on an ‘unseen’ set of 100 images. Seven board-certified ophthalmologists were asked to classify these test images. We measured classification performance through accuracy, F1-score, positive predictive value (PPV), sensitivity, and specificity. Reliability was measured through the agreement between confidence and accuracy of predictions. Results: Board-certified ophthalmologists achieved a mean accuracy of 72.7% (SD: 6.0%) over all classes, while the DCE achieved a greater mean accuracy of 79.2% (SD: 2.3%, p = 0.03). The DCE also achieved a greater mean PPV ( p = 0.0005), sensitivity ( p = 0.03), specificity ( p = 0.03), and F1-score ( p = 0.02) than ophthalmologists over all classes. When performing analysis based on each class, the DCE had a statistically significant higher mean F1-score for DR classification compared to the ophthalmologists (76.8% vs. 57.5%; p = 0.01), and greater but statistically non-significant mean F-scores for glaucoma (83.9% vs. 75.7%; p = 0.10), AMD (85.9% vs. 85.2%; p = 0.69), and normal eyes (73.0% vs. 70.5%; p = 0.39). We also found that the DCE had better reliability than the ophthalmologists, with a greater mean agreement between accuracy and confident of 81.6% vs. 70.3% ( p < 0.001). Conclusions: We developed a deep learning model and found that it could more accurately and more reliably classify four categories of fundus images compared to board-certified ophthalmologists. This work provides proof of principle that an AI algorithm is capable of accurate and reliable recognition of multiple retinal diseases using fundus photographs only.