Welcome to PracticeUpdate! We hope you are enjoying access to a selection of our top-read and most recent articles. Please register today for a free account and gain full access to all of our expert-selected content.
Already Have An Account? Log in Now
Assessing the Medical Reasoning Skills of GPT-4 in Complex Ophthalmology Cases
abstract
This abstract is available on the publisher's site.
Access this abstract nowBACKGROUND/AIMS
This study assesses the proficiency of Generative Pre-trained Transformer (GPT)-4 in answering questions about complex clinical ophthalmology cases.
METHODS
We tested GPT-4 on 422 Journal of the American Medical Association Ophthalmology Clinical Challenges, and prompted the model to determine the diagnosis (open-ended question) and identify the next-step (multiple-choice question). We generated responses using two zero-shot prompting strategies, including zero-shot plan-and-solve+ (PS+), to improve the reasoning of the model. We compared the best-performing model to human graders in a benchmarking effort.
RESULTS
Using PS+ prompting, GPT-4 achieved mean accuracies of 48.0% (95% CI (43.1% to 52.9%)) and 63.0% (95% CI (58.2% to 67.6%)) in diagnosis and next step, respectively. Next-step accuracy did not significantly differ by subspecialty (p=0.44). However, diagnostic accuracy in pathology and tumours was significantly higher than in uveitis (p=0.027). When the diagnosis was accurate, 75.2% (95% CI (68.6% to 80.9%)) of the next steps were correct. Conversely, when the diagnosis was incorrect, 50.2% (95% CI (43.8% to 56.6%)) of the next steps were accurate. The next step was three times more likely to be accurate when the initial diagnosis was correct (p<0.001). No significant differences were observed in diagnostic accuracy and decision-making between board-certified ophthalmologists and GPT-4. Among trainees, senior residents outperformed GPT-4 in diagnostic accuracy (p≤0.001 and 0.049) and in accuracy of next step (p=0.002 and 0.020).
CONCLUSION
Improved prompting enhances GPT-4's performance in complex clinical situations, although it does not surpass ophthalmology trainees in our context. Specialised large language models hold promise for future assistance in medical decision-making and diagnosis.
Additional Info
Disclosure statements are available on the authors' profiles:
Assessing the medical reasoning skills of GPT-4 in complex ophthalmology cases
Br J Ophthalmol 2024 Sep 20;108(10)1398-1405, D Milad, F Antaki, J Milad, A Farah, T Khairy, D Mikhail, CÉ Giguère, S Touma, A Bernstein, AA Szigiato, T Nayman, GA Mullie, R DuvalFrom MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine.
A work in progress: Toward large language model–assisted ophthalmic decision–making
We read, with great interest, the study by Milad et al on GPT-4's ability to reason through published ophthalmic cases. The authors utilized zero-shot prompting and plan-and-solve+ prompt engineering to guide GPT-4 through cases from the Journal of the American Medical Association (JAMA)'s Ophthalmology Clinical Challenge section and compared GPT-4's performance in diagnostic and next-step tasks with that of ophthalmologists in various stages of training. Their key findings were: 1) although GPT-4 demonstrated moderate performance in complex cases, its performance did not exceed that of ophthalmologists as is; and 2) prompt engineering improved the performance of GPT-4 on various reasoning tasks.
This study highlights the importance of prompt engineering, both through zero-shot prompting (ie, role-play prompting) and a plan-and-solve approach. Prompt engineering, or the process of designing and refining input prompts to optimize a large language model's (LLM's) response, has been previously shown to improve the performance of LLMs in various conversational and diagnostic question-and-answer tasks.1 This study addresses a key gap in knowledge regarding how LLMs may perform on clinical reasoning skills (ie, what to do next?), a key component of diagnosis and management.
LLMs have the potential to play a significant role in assisting ophthalmologists with disease diagnosis and management. However, it is unsurprising that, in this study, the performance of GPT-4 did not exceed that of ophthalmologists despite the use of prompt engineering, as it is unlikely that significant literature exists for many of the rare entities published in JAMA Clinical Challenges. However, as the authors noted, it is unclear whether this dataset was included in the training of GPT-4, which would defeat the analysis. The development, fine-tuning, or vectoring of an ophthalmology-specific GPT, along with the integration of imaging, may improve its performance in diagnostic and inferential tasks compared with that of an ophthalmologist as well as decrease "hallucinations." More work is needed to understand how LLMs can be improved to better assist and extend the ability of ophthalmologists to deliver quality eye care.
Reference