ChatGPT has successfully passed a radiology board-style exam, demonstrating the potential of large language models in medical contexts. The study utilized 150 multiple-choice questions mimicking the style and difficulty of Canadian Royal College and American Board of Radiology exams.
The most recent version of ChatGPT underwent a radiology board-style exam, revealing both the potential and limitations of large language models for reliable performance, as outlined in two research studies published in Radiology.
ChatGPT is an AI chatbot that employs a deep learning model to identify patterns and connections among words in its extensive training data. It generates human-like responses based on prompts. However, due to the absence of a definitive source of truth in its training data, it can generate responses that are factually incorrect.
According to lead author Rajesh Bhayana, M.D., FRCPC, an abdominal radiologist and technology lead at University Medical Imaging Toronto, Toronto General Hospital in Canada, “The use of large language models like ChatGPT is rapidly increasing and will continue to do so. Our research sheds light on ChatGPT’s performance in the field of radiology, showcasing its immense potential as well as the current limitations that undermine its reliability.”
Dr. Bhayana pointed out that ChatGPT recently became the fastest-growing consumer application in history. Similar chatbots are being integrated into popular search engines such as Google and Bing, which physicians and patients use for medical information searches.
To evaluate ChatGPT’s performance on radiology board exam questions and explore its strengths and limitations, Dr. Bhayana and his colleagues initially tested GPT-3.5, the most widely used version of ChatGPT.
The researchers used 150 multiple-choice questions designed to match the style, content, and difficulty of the Canadian Royal College and American Board of Radiology exams. The questions did not include images and were categorized by question type to gain insights into performance: lower-order thinking (knowledge recall, basic understanding) and higher-order thinking (application, analysis, synthesis).
The higher-order thinking questions were further classified by type, including description of imaging findings, clinical management, calculation and classification, and disease associations.
Overall performance of ChatGPT based on GPT-3.5 was assessed, along with performance by question type and topic. The researchers also evaluated the confidence of the language used in the responses.
The findings revealed that ChatGPT based on GPT-3.5 correctly answered 69% (104 out of 150) of the questions, which is close to the passing grade of 70% set by the Royal College in Canada. The model demonstrated relatively good performance on lower-order thinking questions, answering 84% (51 out of 61) of them correctly, but struggled with higher-order thinking questions, achieving a 60% (53 out of 89) accuracy rate.
In particular, the model encountered difficulties with higher-order thinking questions involving the description of imaging findings (61%, 28 out of 46), calculation and classification (25%, 2 out of 8), and application of concepts (30%, 3 out of 10). Its weaker performance on higher-order thinking questions was expected due to the lack of radiology-specific pretraining.
GPT-4 was introduced in March 2023 in a limited capacity for paid users. It claimed to possess improved advanced reasoning capabilities compared to GPT-3.5.
In a subsequent study, GPT-4 correctly answered 81% (121 out of 150) of the same questions, surpassing GPT-3.5 and exceeding the passing threshold of 70%. GPT-4 performed significantly better than GPT-3.5 on higher-order thinking questions, particularly those related to the description of imaging findings (85%) and application of concepts (90%).
These findings suggest that the claimed improved advanced reasoning capabilities of GPT-4 translate to enhanced performance in the field of radiology. They also indicate an improved contextual understanding of radiology-specific terminology, including imaging descriptions, which is crucial for future