搜索 — ResearchTracker

Large language models (LLMs) show promise in drafting responses to patient portal messages, yet their integration into clinical workflows raises various concerns, including whether they would actually save clinicians time and effort in their portal workload. We investigate LLM alignment with individual clinicians through a comprehensive evaluation of the patient message response drafting task. We develop a novel taxonomy of thematic elements in clinician responses and propose a novel evaluation framework for assessing clinician editing load of LLM-drafted responses at both content and theme levels. We release an expert-annotated dataset and conduct large-scale evaluations of local and commercial LLMs using various adaptation techniques including thematic prompting, retrieval-augmented generation, supervised fine-tuning, and direct preference optimization. Our results reveal substantial epistemic uncertainty in aligning LLM drafts with clinician responses. While LLMs demonstrate capability in drafting certain thematic elements, they struggle with clinician-aligned generation in other themes, particularly question asking to elicit further information from patients. Theme-driven adapt

Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm

arXiv2026-05-27作者：Yuming, Huang, Yao Liu

Benchmarking is mature where answers are verifiable -- math, code, reasoning -- but the fastest-growing uses of LLMs are subjective and human-facing: companionship, emotional support, counseling. There the default validity test, correlating a metric to human judgment, has no stable anchor: inter-rater agreement is low, structured by annotator identity, barely reproducible, and length-biased. So we cannot answer the question that matters: does capability that scales on objective benchmarks transfer to subjective behavior, and would our instruments even tell us if it did not? We build an instrument for this regime and report what it reveals at the frontier. We contribute, first, a self-evolving instrument that selects and then authors its own behavioral dimensions under a multiplicative anti-gaming fitness, self-halting when it stops improving; second, a trust-by-construction paradigm that earns belief through three certificates established without a human gold standard, where human raters saturate (rho ~ 0.45); and third, the finding it makes visible -- capability transfer is dissociable. Across 49 models, 8 families, and 24 months, subjective behaviors are where objective-benchmark

搜索结果：would

How Much Would a Clinician Edit This Draft? Evaluating LLM Alignment for Patient Message Response Drafting

Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm

Less than one percent of words would be affected by gender-inclusive language in German press texts

How Would It Sound? Material-Controlled Multimodal Acoustic Profile Generation for Indoor Scenes

Post-treatment problems: What can we say about the effect of a treatment among sub-groups who (would) respond in some way?

Ammonia or methanol would enable subsurface liquid water in the Martian South Pole

Towards using utility data to quantify how investments would have increased the wind resilience of distribution systems

A Robot Expressing Emotions Through Gestures: Everyone Outside of Italy Would Understand this?

Hidden under a warm blanket: If planets existed in protostellar disks, they would hardly produce observable substructures

How much earlier would LSST have discovered currently known long-period comets?

How Many and Which Training Points Would Need to be Removed to Flip this Prediction?

Would Deep Generative Models Amplify Bias in Future Models?

Identifying a would-be terrorist: An ineradicable error in the data processing?

Would Monetary Incentives to COVID-19 vaccination reduce motivation?

Would I have gotten that reward? Long-term credit assignment by counterfactual contribution analysis

Would Superluminal Influences Violate the Principle of Relativity?

How Frustrated Strings Would Pull the Black Holes from the Centers of Galaxies

To share or not to share: What risks would laypeople accept to give sensitive data to differentially-private NLP systems?

When would online platforms pay data dividends

"If it didn't happen, why would I change my decision?": How Judges Respond to Counterfactual Explanations for the Public Safety Assessment