Perinatal medication consultation is a core clinical pharmacy service that involves a complex benefit-risk assessment for both maternal and fetal safety. Large language models (LLMs) have emerged as potential tools to improve access to medication information, yet their performance and safety in real-world, pharmacist-led perinatal consultation settings, particularly in non-English contexts, remain insufficiently evaluated. To evaluate and compare the performance of multiple advanced large language models in addressing real-world Chinese perinatal medication consultation queries and to assess their potential role as supervised adjunctive tools within clinical pharmacy services. This cross-sectional study evaluated seven LLMs using real-world clinical data from pharmacist-led medication consultations at the Pharmacy Clinic of the Beijing Obstetrics and Gynecology Hospital, Capital Medical University. A standardized test set of 64 perinatal medication consultation questions was developed from 15,280 electronic consultation records collected between April 2014 and April 2024. The evaluated models included international (GPT-5.1, Grok 3, Gemini 3.0) and domestic (DeepSeek, Wenxin Yiyan, Kimi K2, Tongyi Qianwen) models. Senior clinical pharmacologists independently assessed responses across four dimensions-relevance, accuracy, usefulness, and empathy-using a 10-point Likert scale. Results are reported primarily as median (IQR), with mean ± SD additionally provided as a secondary descriptor to facilitate comparison with prior literature. Among the 448 model-generated responses, inter-rater consistency was excellent (ICC = 0.91, 95% CI 0.88-0.94). Significant differences in overall performance were observed among the models (Kruskal-Wallis H = 187.4, p < 0.001; ε2 = 0.41, large effect). GPT-5.1 achieved the highest median total score [9.3 (IQR: 8.8-9.6); mean ± SD: 9.1 ± 0.8], outperforming all other models (all Bonferroni-corrected p < 0.01; all r > 0.50, large effect sizes), followed by Kimi K2 [8.5 (IQR: 7.9-9.1); mean ± SD: 8.4 ± 1.2] and DeepSeek [8.3 (IQR: 7.6-8.9); mean ± SD: 8.2 ± 1.1]. Tongyi Qianwen demonstrated the lowest overall performance [6.7 (IQR: 5.9-7.4); mean ± SD: 6.8 ± 1.3]. Accuracy was the primary determinant of performance differences. Performance gaps were more pronounced in complex clinical scenarios involving comorbidities or benefit-risk trade-offs, whereas domestic models demonstrated relative advantages in consultations involving traditional Chinese medicine. LLMs have demonstrated variable performance in response to perinatal medication consultation queries. While high-performing models show potential to support pharmacist-led perinatal medication consultations by improving access to information, their current performance supports use only as supervised, adjunctive decision-support tools rather than independent sources of medication counseling, with human oversight essential prior to broader integration.
使用 AI 将内容摘要翻译为中文,便于快速阅读
使用 AI 分析这篇文章的核心发现、关键要点和深度见解
由 DeepSeek AI 提供分析 · 首次使用需配置 API Key
PubMed · 2026-05-05
PubMed · 2026-05-05
PubMed · 2026-04-24
PubMed · 2026-06-01