Linda My Huynh - New Artificial Intelligence ChatGPT Performs Poorly

Journal article

L. Huynh, B. Bonebrake, Kaitlyn Schultis, A. Quach, C. Deibert
Urology Practice, 2023

Cite

APA Click to copy
Huynh, L., Bonebrake, B., Schultis, K., Quach, A., & Deibert, C. (2023). New Artificial Intelligence ChatGPT Performs Poorly on the 2022 Self-assessment Study Program for Urology. Urology Practice.

Chicago/Turabian Click to copy
Huynh, L., B. Bonebrake, Kaitlyn Schultis, A. Quach, and C. Deibert. “New Artificial Intelligence ChatGPT Performs Poorly on the 2022 Self-Assessment Study Program for Urology.” Urology Practice (2023).

MLA Click to copy
Huynh, L., et al. “New Artificial Intelligence ChatGPT Performs Poorly on the 2022 Self-Assessment Study Program for Urology.” Urology Practice, 2023.

BibTeX Click to copy

@article{l2023a,
  title = {New Artificial Intelligence ChatGPT Performs Poorly on the 2022 Self-assessment Study Program for Urology},
  year = {2023},
  journal = {Urology Practice},
  author = {Huynh, L. and Bonebrake, B. and Schultis, Kaitlyn and Quach, A. and Deibert, C.}
}

Abstract

Abstract Introduction: Large language models have demonstrated impressive capabilities, but application to medicine remains unclear. We seek to evaluate the use of ChatGPT on the American Urological Association Self-assessment Study Program as an educational adjunct for urology trainees and practicing physicians. Methods: One hundred fifty questions from the 2022 Self-assessment Study Program exam were screened, and those containing visual assets (n=15) were removed. The remaining items were encoded as open ended or multiple choice. ChatGPT’s output was coded as correct, incorrect, or indeterminate; if indeterminate, responses were regenerated up to 2 times. Concordance, quality, and accuracy were ascertained by 3 independent researchers and reviewed by 2 physician adjudicators. A new session was started for each entry to avoid crossover learning. Results: ChatGPT was correct on 36/135 (26.7%) open-ended and 38/135 (28.2%) multiple-choice questions. Indeterminate responses were generated in 40 (29.6%) and 4 (3.0%), respectively. Of the correct responses, 24/36 (66.7%) and 36/38 (94.7%) were on initial output, 8 (22.2%) and 1 (2.6%) on second output, and 4 (11.1%) and 1 (2.6%) on final output, respectively. Although regeneration decreased indeterminate responses, proportion of correct responses did not increase. For open-ended and multiple-choice questions, ChatGPT provided consistent justifications for incorrect answers and remained concordant between correct and incorrect answers. Conclusions: ChatGPT previously demonstrated promise on medical licensing exams; however, application to the 2022 Self-assessment Study Program was not demonstrated. Performance improved with multiple-choice over open-ended questions. More importantly were the persistent justifications for incorrect responses—left unchecked, utilization of ChatGPT in medicine may facilitate medical misinformation.