Background: Systematic reviewers face a growing body of biomedical literature, making early-stage article screening increasingly time-consuming. In this study, we assessed six large language models (LLMs)—OpenHermes, Flan T5, GPT-2, Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o—for their ability to identify randomized controlled trials (RCTs) in datasets of increasing difficulty. Methods: We first retrieved articles from PubMed and used all-mpnet-base-v2 to measure semantic similarity to known target RCTs, stratifying the collection into quartiles of descending relevance. Each LLM then received either verbose or concise prompts to classify articles as “Accepted” or “Rejected”. Results: Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o consistently achieved high recall, though their precision varied in the quartile with the highest similarity, where false positives increased. By contrast, smaller or older models struggled to balance sensitivity and specificity, with some over-including irrelevant studies or missing key articles. Importantly, multi-stage prompts did not guarantee performance gains for weaker models, whereas single-prompt approaches proved effective for advanced LLMs. Conclusions: These findings underscore that both model capability and prompt design strongly affect classification outcomes, suggesting that newer LLMs, if properly guided, can substantially expedite systematic reviews.
Performance Comparison of Large Language Models for Efficient Literature Screening / Colangelo, M. T.; Guizzardi, S.; Meleti, M.; Calciolari, E.; Galli, C.. - In: BIOMEDINFORMATICS. - ISSN 2673-7426. - 5:2(2025). [10.3390/biomedinformatics5020025]
Performance Comparison of Large Language Models for Efficient Literature Screening
Colangelo M. T.;Guizzardi S.;Meleti M.;Calciolari E.;Galli C.
2025-01-01
Abstract
Background: Systematic reviewers face a growing body of biomedical literature, making early-stage article screening increasingly time-consuming. In this study, we assessed six large language models (LLMs)—OpenHermes, Flan T5, GPT-2, Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o—for their ability to identify randomized controlled trials (RCTs) in datasets of increasing difficulty. Methods: We first retrieved articles from PubMed and used all-mpnet-base-v2 to measure semantic similarity to known target RCTs, stratifying the collection into quartiles of descending relevance. Each LLM then received either verbose or concise prompts to classify articles as “Accepted” or “Rejected”. Results: Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o consistently achieved high recall, though their precision varied in the quartile with the highest similarity, where false positives increased. By contrast, smaller or older models struggled to balance sensitivity and specificity, with some over-including irrelevant studies or missing key articles. Importantly, multi-stage prompts did not guarantee performance gains for weaker models, whereas single-prompt approaches proved effective for advanced LLMs. Conclusions: These findings underscore that both model capability and prompt design strongly affect classification outcomes, suggesting that newer LLMs, if properly guided, can substantially expedite systematic reviews.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


