Passer au contenu de la page principale

Ophthalmology clinical decision support using multimodal large language models with role and chain-of-thought prompting: a comparative observational study - 5634

Mon statut pour la session

Quand:
5:07 PM, Vendredi 20 Juin 2025 (5 minutes)
Author’s Name(s): Stuti Misty Tanya, Raj Pathak, Raheem Remtulla, Anne Nguyen, Merve Kulbay, Guillermo Rocha, Femida Kherani

Author’s Disclosure Block: Stuti Misty Tanya: none; Raj Pathak: none; Raheem Remtulla: none; Anne Nguyen: none; Merve Kulbay: none; Guillermo Rocha: none; Femida Kherani: none 

Abstract Body

Purpose: Prompt engineering is the practice of curating inputs provided to large language models (LLMs) to elicit more accurate and relevant outputs for the user’s intended purpose. We utilized role prompting (RP) and chain-of-thought prompting (CoTP) across three popular multimodal LLMs (mLLMs)—Claude 3.5 Sonnet (Claude), ChatGPT-4o (GPT), and Gemini Advanced 1.5 Pro (Gemini)—and assessed their capabilities for ophthalmology clinical decision support. Study Design: Comparative observational study. Methods: One input was created with RP and a second input was created with RP and CoTP, along with instructions to address each case in four steps: differential diagnosis, diagnostic tests, medical treatments, and surgical treatments. The inference temperature was set to zero for deterministic responses. Thirty cases across ten ophthalmology domains were phrased to emulate an ophthalmology resident’s interpretation; 47% of cases were submitted with images. One physician score and three mLLM-as-judge scores (one per mLLM) were generated for each RP and RP-CoTP response. Responses were scored for appropriateness according to 7-point (0 to 6) Likert-based scale (1). Results: Based on physician feedback, RP and RP-CoTP total scores were 5.30 versus 5.33 (Δ0.03) for Claude, 4.80 versus 5.27 (Δ0.47) for GPT, and 4.53 versus 4.40 for Gemini (Δ-0.13). RP versus RP-CoTP response scores by Claude for Claude were 5.67 and 5.70 (Δ0.03); for GPT, the scores were 5.67 and 5.37 (Δ-0.30); and for Gemini, the scores were 5.83 and 5.73 (Δ-0.10), respectively. RP versus RP-CoTP responses scores by GPT for Claude were 5.50 and 5.33 (Δ-0.17); for GPT, the scores were 5.27 and 4.90 (Δ-0.37); and for Gemini, the scores were 5.40 and 5.17 (Δ-0.23), respectively. RP versus RP-CoTP responses scores by Gemini for Claude were 5.63 and 5.33 (Δ-0.30); for GPT, the scores were 5.43 and 5.13 (Δ-0.30); and for Gemini, the scores were 5.57 and 5.17 (Δ-0.40), respectively. RP and RP-CoTP scores in Claude for cases with images were 5.29 and 5.21 compared to 5.31 and 5.44 without images; scores in GPT for cases with images were 4.93 and 5.07 compared to 4.69 and 5.44 without images; and scores in Gemini for cases with images were 4.86 and 4.64, compared to 4.25 and 4.19 without images. Conclusions: mLLMs perform strongly for clinical decision support, with Claude demonstrating the highest score based on physician feedback. Based on physician feedback, RP-CoTP performed better than RP alone;however, based on LLM-as-judge feedback, RP alone performed better than RP-CoTP. Both iterations of Claude and Gemini performed better without images; whereas GPT with RP performed better with images, GPT with both but RP-CoTP performed better without images.Refining prompt complexity and token efficiency to improve cognitive load may help to increase the performance of prompt engineered mLLMs for ophthalmology clinical decision support.

Tanya (Misty) Stuti

Conférencier.ère

Mon statut pour la session

Évaluer

Detail de session
Pour chaque session, permet aux participants d'écrire un court texte de feedback qui sera envoyé à l'organisateur. Ce texte n'est pas envoyé aux présentateurs.
Afin de respecter les règles de gestion des données privées, cette option affiche uniquement les profils des personnes qui ont accepté de partager leur profil publiquement.

Les changements ici affecteront toutes les pages de détails des sessions