Artificial intelligence performs better than doctors, studies find

The AI tool GPT-4 is better at diagnosing eye conditions than most doctors, and better at clinical reasoning than many doctors, according to new research

23rd April 2024 about a 4 minute read
“We could realistically deploy AI in triaging patients with eye issues to decide which cases are emergencies that need to be seen by a specialist immediately, which can be seen by a GP, and which don’t need treatment." Dr Arun Thirunavukarasu, doctor, Oxford University Hospitals NHS Foundation Trust

Artificial intelligence (AI) has been found to perform better than doctors in two separate studies.

In the first study, University of Cambridge researchers compared the ability of an AI model to diagnose eye problems.

The model, GPT-4, was tested against doctors at different stages in their careers, including non-specialist junior doctors, trainee eye doctors and expert eye doctors (ophthalmologists). Each was presented with a series of 87 patient scenarios involving a specific eye problem, and asked to give a diagnosis or advise on treatment by selecting from four options.

The test included questions on a wide range of topics, including extreme light sensitivity, decreased vision, lesions, itchy and painful eyes, taken from a textbook used to test trainee eye doctors.

GPT-4, which powers the online chatbot ChatGPT to provide bespoke responses to human queries, scored significantly better in the test than the non-specialist junior doctors, and similar scores to trainee and expert eye doctors, although the top performing doctors scored higher. The study is published in PLOS Digital Health.

The researchers said that large language models are unlikely to replace healthcare professionals, but have the potential to improve healthcare as part of the clinical workflow.

AI could be deployed to triage patients

They said that language models like GPT-4 could be useful for providing eye-related advice, diagnosis and management suggestions in well-controlled contexts, like triaging patients, or where access to specialist healthcare professionals is limited.

“We could realistically deploy AI in triaging patients with eye issues to decide which cases are emergencies that need to be seen by a specialist immediately, which can be seen by a GP, and which don’t need treatment,” said Dr Arun Thirunavukarasu, lead author of the study. Thirunavukarasu, formerly a University of Cambridge student but is now a doctor at Oxford University Hospitals NHS Foundation Trust, added: “The models could follow clear algorithms already in use, and we’ve found that GPT-4 is as good as expert clinicians at processing eye symptoms and signs to answer more complicated questions.”

He said that it was important to “characterise the capabilities and limitations of commercially available models, as patients may already be using them –  rather than the internet – for advice.”

Thirunavukarasu said, however, that he thought doctors would continue to be in charge of patient care: “The most important thing is to empower patients to decide whether they want computer systems to be involved or not. That will be an individual decision for each patient to make.”

The researchers said that their study is superior to similar, previous studies because they compared the abilities of AI to practising doctors, rather than to sets of examination results.

AI tested with simulated clinical cases

In another study carried out by researchers at Harvard University, the same tool, GPT-4, performed better on clinical reasoning than two groups of doctors – residents and attending physicians. GPT-4 had more instances of incorrect reasoning than the doctors did but scored better overall.

The research involved 39 doctors from two academic medical centres in Boston, who were presented with 20 simulated clinical cases involving common problems such as pharyngitis, headache abdominal pain, cough, and chest pain. Each case included sections describing the triage presentation, review of systems, physical examination and diagnostic testing.

The results were assessed using the Revised-IDEA (R-IDEA) score, a 10-point scale evaluating clinical reasoning documentation across four domains: interpretive summary, differential diagnosis, explanation of the lead diagnosis and alternative diagnoses.

The AI tool achieved a median R-IDEA score of 10, higher than attending physicians (median score, 9) and residents (8).

However, AI provided more responses that contained instances of incorrect clinical reasoning (13.8%) than residents (2.8%) and attending physicians (12.5%).

FCC Insight

These two studies show how rapidly AI is progressing. The tool used, GPT-4, can now perform as effectively as most doctors in both general and specialist contexts. We must be cautious, however. Performing well in a research setting is not the same as performing well in day-to-day clinical practice. And in the study on eye diseases, the best of the specialist doctors still fared better than the AI tool. We agree with Dr Thirunavukarasu that the main use of AI will be in triage, to assess which patients need emergency treatment. This could create significant efficiencies for the NHS and speed up patients’ access to the specialist care they need.