Australian researchers test performance of ChatGPT on optometry exam questions

Scientists find the latest version of the AI tool “excelled” across a range of optometry and vision science written questions

1 min read 1

A woman speaks into a mobile phone held horizontally

Getty/Keeproll

Selina Powell

06 August 2025

Researchers from the University of New South Wales have described the performance of a large language model (LLM) across a variety of optometry and vision science written response questions.

Writing in Ophthalmic and Physiological Optics, scientists highlighted that earlier models of ChatGPT (GPT-3.5 and GPT-4) demonstrated “variable but generally passable performance” across the set of sample questions – which included past written exam questions.

The latest version of ChatGPT (o1) “excelled across all questions,” the authors noted.

“The results of the study have shown that LLMs are able to generate satisfactory responses to various assessment questions in the field of optometry and vision science, and in many cases excel at these,” the researchers highlighted.

“Subsequent models showed significantly greater capabilities over preceding models,” they added.

The authors also assessed the performance of ChatGPT as a grader of written questions, by exploring the concordance between the AI tool and a human grader.

They found that while ChatGPT graders generally awarded higher marks than human graders, this was only statistically significant for GPT-3.5.

“The result of the study suggests there is an urgent need for optometry and vision science educators to adopt new learning and teaching strategies in the ‘ChatGPT-era’,” the researchers stated.

Comments (1)

You must be logged in to join the discussion. Log in

Don Williams13 August 2025

This is a valuable and timely study. It confirms what many of us have observed at the coalface: model quality has moved from variable to consistently strong, with o1 now producing coherent, clinically plausible written answers to optometry and vision science questions. That is encouraging for educators who want richer explanations, rapid feedback and new ways to support learning.

The caution is that written response performance is not the same as safe clinical reasoning. Real practice involves incomplete data, image interpretation, atypical presentations and choices under uncertainty. There is also a prompt effect. The model’s best work often reflects a well structured prompt. If the authors used excellent prompts, results may overstate what an average student or educator will obtain. Uneven prompt skill can introduce equity issues and inflate perceived competence.

For education, the response should be pragmatic. Redesign assessments to reveal thinking and judgement through viva style questioning, data interpretation with OCT and fields, and supervised case work. If AI is used for formative marking, keep human moderation, clear rubrics and version transparency. Teach students to critique model outputs against primary sources and local policy, disclose any AI assistance and show their reasoning.

In short, this is good news for teaching and learning. Treat the models as accelerators for explanation and practice, not as substitutes for expertise or accountability.

ReportLike0

Your comments

Anonymous

CommentText

We welcome your opinion and encourage readers to engage. In doing so, we remind you that this is a professional forum and ask that you are polite, refrain from using libellous or abusive language, and respect the views of others. If you would like to share feedback on this article with OT, please email the author by clicking here.
Please add a valid message