搜索 — ResearchTracker

Large Language Models (LLMs) have demonstrated considerable potential in general practice. However, existing benchmarks and evaluation frameworks primarily depend on exam-style or simplified question-answer formats, lacking a competency-based structure aligned with the real-world clinical responsibilities encountered in general practice. Consequently, the extent to which LLMs can reliably fulfill the duties of general practitioners (GPs) remains uncertain. In this work, we propose a novel evaluation framework to assess the capability of LLMs to function as GPs. Based on this framework, we introduce a general practice benchmark (GPBench), whose data are meticulously annotated by domain experts in accordance with routine clinical practice standards. We evaluate ten state-of-the-art LLMs and analyze their competencies. Our findings indicate that current LLMs are not suitable for autonomous deployment in clinical general practice and that all realistic applications require continuous human oversight; further optimization specifically tailored to the daily responsibilities of GPs remains essential.

PsychBench: A comprehensive and professional benchmark for evaluating the performance of LLM-assisted psychiatric clinical practice

arXiv2025-02-28作者：Shuyu Liu, Ruoxi Wang, Ling Zhang

The advent of Large Language Models (LLMs) offers potential solutions to address problems such as shortage of medical resources and low diagnostic consistency in psychiatric clinical practice. Despite this potential, a robust and comprehensive benchmarking framework to assess the efficacy of LLMs in authentic psychiatric clinical environments is absent. This has impeded the advancement of specialized LLMs tailored to psychiatric applications. In response to this gap, by incorporating clinical demands in psychiatry and clinical data, we proposed a benchmarking system, PsychBench, to evaluate the practical performance of LLMs in psychiatric clinical settings. We conducted a comprehensive quantitative evaluation of 16 LLMs using PsychBench, and investigated the impact of prompt design, chain-of-thought reasoning, input text length, and domain-specific knowledge fine-tuning on model performance. Through detailed error analysis, we identified strengths and potential limitations of the existing models and suggested directions for improvement. Subsequently, a clinical reader study involving 60 psychiatrists of varying seniority was conducted to further explore the practical benefits of ex

搜索结果：Neurology. Clinical practice

Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark

PsychBench: A comprehensive and professional benchmark for evaluating the performance of LLM-assisted psychiatric clinical practice

Embedding computational neurorehabilitation in clinical practice using a modular intelligent health system

Grounding Clinical AI Competency in Human Cognition Through the Clinical World Model and Skill-Mix Framework

From Generative Modeling to Clinical Classification: A GPT-Based Architecture for EHR Notes

Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text

Association between quality of clinical practice guidelines and citations given to their references

A Governance and Evaluation Framework for Deterministic, Rule-Based Clinical Decision Support in Empiric Antibiotic Prescribing

Clinical ModernBERT: An efficient and long context encoder for biomedical text

SoftTiger: A Clinical Foundation Model for Healthcare Workflows

From Clinical Intent to Clinical Model: Autonomous Coding-Agents for Clinician-driven AI Development

Health System Scale Semantic Search Across Unstructured Clinical Notes

AI Driven Knowledge Extraction from Clinical Practice Guidelines: Turning Research into Practice

Evaluation of Galaxy as a User-friendly Bioinformatics Tool for Enhancing Clinical Diagnostics in Genetics Laboratories

AI-based Clinical Decision Support for Primary Care: A Real-World Study

Design for a Digital Twin in Clinical Patient Care

A review of handcrafted and deep radiomics in neurological diseases: transitioning from oncology to clinical neuroimaging

Judge-dependent safety gains and model-specific helpfulness costs of evidence-sufficiency prompting in clinical LLMs

Adapting Abstract Meaning Representation Parsing to the Clinical Narrative -- the SPRING THYME parser

On the Risk of Misleading Reports: Diagnosing Textual Biases in Multimodal Clinical AI