Agentic artificial intelligence (AI) systems, characterized by autonomous goal-directed behavior, multi-step reasoning, task decomposition, and tool use, are increasingly proposed for healthcare applications. However, their autonomy raises concerns regarding transparency, accountability, and human oversight. While explainable AI (XAI) has been widely studied in traditional predictive models, less is known about how explainability is implemented within agentic architectures. To map the emerging literature on explainable agentic AI (XAAI) in healthcare and characterize the types, scope, and forms of explainability used in these systems. A scoping review was conducted following PRISMA-ScR guidelines. PubMed, Embase, IEEE Xplore, and ACM Digital Library were searched through November 2025. Eligible studies described healthcare-related agentic AI systems incorporating explicit explainability mechanisms. Data were extracted on system architecture, explainability type (intrinsic, post hoc, hybrid), explanation scope (local, global), explanation form, and reported clinical outcomes. Nine studies met the inclusion criteria. All systems demonstrated core agentic features, including autonomy, task decomposition, and tool integration, often within multi-agent frameworks. Explainability was predominantly intrinsic and workflow-native, typically delivered through textual reasoning traces and example-based grounding in retrieved clinical evidence. Feature-based and global explanations were comparatively rare and largely confined to hybrid architectures. Across domains including radiology, neurology, psychiatry, and biomedical research, XAAI systems were reported to improve performance and interpretability relative to baseline models in the included studies. However, these findings were derived from heterogeneous, predominantly experimental or retrospective studies, and structured human-in-the-loop oversight was infrequently described. Current XAAI systems appear to emphasize process transparency and evidence grounding rather than mechanistic model-level attribution. The available evidence remains limited and heterogeneous, and findings should be interpreted as early trends rather than established characteristics. Further progress will require standardized evaluation frameworks, clearer reporting of oversight mechanisms, and validation in real-world clinical settings to support safe and trustworthy integration of agentic AI into healthcare practice.
As agentic artificial intelligence systems become increasingly embedded in medical imaging, practice is moving from episodic decision support to workflow-based architectures that alter how practitioners think and practise. Medical imaging practice is traditionally conceptualised using Dual Process Theory, which describes how practitioners use their System 1 (intuitive decision making) and System 2 (analytic decision making) in practice. However, as more practitioners incorporate agentic artificial intelligence systems into their workflow, a Tri-System framework may be required. This Perspective paper will show how the practitioner and an agentic artificial intelligence system become part of a cognitive team known as System 3. It will argue that an appropriate level of cognitive surrender should be considered and that current decision making should be reframed through diagnostic complementarity, with added emphasis on structured human and AI interaction to achieve optimal performance. We recommend the implementation of the following educational methods in radiography programmes: (a) training students using fault-injected medical images to reinforce the importance of human verification in image interpretation; (b) preparing students to supervise the performance of agentic artificial intelligence systems; (c) normalising AI-assisted activities to mitigate potential deskilling.
Artificial intelligence (AI) tools are shifting from passive, user-initiated tools to proactive agentic AI systems that are capable of autonomous, multi-step actions. These agents can independently gather information, execute sequential tasks, and collaborate with humans or other agents without requiring constant prompting from humans. Early adopters in health care have demonstrated early feasibility across multiple specialties and clinical settings. Dermatology is well-positioned to benefit given its high patient volumes, administrative burdens, and clinicopathological workflows. To guide responsible adoption of agentic AI, we propose a risk-stratification framework based on clinical risk and task reversibility. Barriers to widespread adoption of agentic AI include limitations in model reliability, interoperability across health records, and unresolved questions around liability, privacy, and regulation. Dermatologists must proactively engage via professional organizations and industry partnerships to ensure that agentic AI is developed safely, equitably, and in alignment with our values.
Agentic AI systems integrate foundation models, prompt templates, tool connectors, orchestration logic, and containerised dependencies, creating exploitability conditions that cannot be inferred from static Software Bills of Materials (SBOMs). Artificial Intelligence Bills of Materials (AIBOM) extend transparency to AI-specific artefacts, yet current CSAF/VEX workflows remain based on static component-CVE correlation without runtime validation. A protocol-driven framework is presented that binds SBOM and AIBOM artefacts to deterministic environment capture and structured runtime telemetry. Exploitability is computed from declared artefacts, observed activation conditions, and enforced execution policies. CSAF-VEX advisories are generated from combined static and runtime evidence, cryptographically signed, and validated through deterministic replay. Evaluation uses approximately 10,000 component entries across synthetic Agentic AI workloads (50-5,000 components), incorporating OSV, GitHub Advisory, KEV, and EPSS datasets. Under controlled experimental conditions, the framework achieves an F1-score of 0.93 (precision 0.96, recall 0.92), reduces false positives by up to 42% relative to static SBOM-CVE matching without runtime validation, and alters exploitability outcomes in 31% of AI-specific artefact cases through AIBOM extension. Advisory artefacts remain reproducible under deterministic replay. Binding AIBOM artefacts to runtime telemetry transforms CSAF-VEX generation from static disclosure into execution-grounded exploitability assessment for Agentic AI supply chains.
Spatial transcriptomics and proteomics map tissue architecture and cellular interactions, but analysis remains limited by programming demands and text-centered AI agents that lack viewer grounding and cross-turn context. We present spatiAlytica, a viewer-centric multimodal interactive agentic system embedded in the Napari viewer that enables non-programmer biologists to perform iterative, hypothesis-driven spatial omics analysis via natural language. spatiAlytica couples viewer-state serialization, agentic memory, biological concept-to-data-field mapping, code generation and debugging, Spatial VQA, and grounded interpretation to support an exploratory analysis and interpretive reasoning workflow. We introduce spatiAlyticaBench, a comprehensive benchmark spanning 222 single-turn spatial analytical coding questions, 178 multi-turn sequential workflow questions, and 7,350 image-grounded reasoning questions. spatiAlytica outperformed strong agentic baselines, while using less time and tokens. Case studies across Kaposi's sarcoma, colorectal cancer, and ovarian cancer recapitulated known spatial patterns and uncovered progressive CD8 T-cell dysfunction during KS progression.
Agentic tools - software environments where a large language model plans, calls external tools, executes code, and iterates with minimal human intervention - will run a substantial share of routine biomedical data analysis within the next few years. However, per-call inference cost on frontier models is the bottleneck and can add up quickly. Here, we tested whether a free, locally-runnable open-weight model could take over the repetitive execution steps at frontier accuracy. We used Claude's Opus to author plans of increasing detail for per-sample variant calling, and ran six 2026-release open-weight implementer LLMs against those plans on a set of desktop GPUs. qwen3.6:27b reproduced frontier accuracy on every plan and matched Opus cell-for-cell on a 36-cell error-injection matrix. A sub-$2,000 Jetson or Apple Mac Mini sufficed for the implementer side. The open-weight model landscape evolves on the order of months, so the specific implementer recommended here will be superseded; we provide the plans, harness, scoring code, and per-cell artifacts at https://github.com/nekrut/LLM-eval-paper as a framework for re-evaluating future models.
Nanopore sequencing has enabled various layers of information about DNA and RNA sequence isoforms and chemical modifications. Yet, the archipelago of disjoint nanopore analysis tools makes navigating among these a significant challenge for the nanopore user. We present NanoCortex, a unified autonomous agentic framework designed to bridge this shortcoming by providing end-to-end data processing which ranges from raw signal basecalling to biological interpretation. Built upon Gemini API services that incur usage-based API costs and orchestrated through the Gemini Agent Development Kit (ADK), the system utilizes a multi-agent architecture to autonomously perform task parsing, code generation, iterative code-level self-correction of code, and scientific interpretation. Following code generation, the code can be used offline. Benchmarking reveals that NanoCortex achieves significantly higher usability across complex analytical tasks compared to general-purpose large language models. The framework seamlessly integrates experimental data with meta-analysis of publicly available, biological databases to facilitate the extraction of biologically meaningful insights from sequencing data without cumbersome computational steps.
Background Clinical documentation and information retrieval consume over half of physicians' working hours, contributing to cognitive overload and burnout. While artificial intelligence (AI) offers a potential solution, concerns over hallucinations and source reliability have limited adoption at the point of care. This study aimed to evaluate physician-perceived time efficiency, decision-making support, and satisfaction with DR. INFO, an agentic AI clinical assistant, in routine clinical practice. Methodology In this prospective, single-arm, pilot feasibility study, 29 physicians and medical students across multiple specialties in Portuguese healthcare institutions used DR. INFO v1.0 over five working days within a two-week period. Outcomes were assessed via daily Likert-scale evaluations (time saving and decision support) and a final Net Promoter Score (NPS). Non-parametric methods were used throughout, with bootstrap confidence intervals (CIs) and sensitivity analysis to address non-response. Results Physicians reported high perceived time saving (mean = 4.27/5; 95% CI = 3.97-4.57) and decision support (mean = 4.16/5; 95% CI = 3.86-4.45), with ratings stable across the five-day study window. Among the 16 (55%) participants who completed the final evaluation, the NPS was 81.2, with no detractors; sensitivity analysis indicated an NPS of 44.8 under conservative non-response assumptions. Conclusions Physicians across specialties and career stages reported positive perceptions of DR. INFO for both time efficiency and clinical decision support within the study window. These findings are preliminary and should be confirmed in larger, controlled studies that include objective performance measures and independent accuracy verification.
We present an agentic workflow that converts heterogeneous safety evidence into concise, reproducible drug summaries. While automated FAERS summarization, retrieval-augmented generation, and tool-driven agents exist in isolation, our contribution lies in their integration within a schema-aware, deterministic pipeline with explicit versioning and pharmacokinetic contextualization. The system queries FAERS via OpenFDA, integrates curated cytochrome P450 mappings, and can retrieve recent PubMed records. It normalizes fields, computes predefined aggregates, assesses enzyme overlap between index drugs and frequent co-medications, and generates constrained narratives and figures directly from computed tables. Applied to 110 drugs, the workflow recovered clear cross-drug patterns in severe outcomes and identified per-drug leaders for death and hospitalization. Case examples for clopidogrel and voriconazole illustrate how co-reporting patterns combined with CYP context provide mechanistic framing without implying causality. Deterministic execution, versioned queries, and cached responses enable exact reruns and audit. The workflow produces structured safety briefs that support safety committee review, early signal triage, and the selection of targets for confirmatory pharmacoepidemiologic studies.
We present a multi-agentic workflow for critical materials recovery that deploys a series of AI agents and automated instruments to recover critical materials from produced water and magnet leachates. This approach achieves selective precipitation from real-world feedstocks using simple chemicals, accelerating the optimization of efficient, adaptable, and scalable separations to a timeline of days, rather than months and years.
Variant interpretation in rare diseases requires navigating multiple genomic databases, each with strict input formats, while synthesizing heterogeneous evidence. This process creates significant barriers for non-experts and imposes a substantial cognitive burden on experienced specialists. These challenges are evident in tools such as model organism aggregated resources for rare variant exploration (MARRVEL), which require precise variant formatting (e.g., Human Genome Variation Society [HGVS] notation) and return complex, heterogeneous outputs. To address these usability barriers, we developed MARRVEL-MCP, a natural-language interface that enables large language models (LLMs) to perform end-to-end variant interpretation via structured tool access. This work demonstrates the impact of tool-augmented context engineering, the purposeful design of domain-aware tool environments and structured information scaffolding through executable function interfaces, on reshaping the role of model scale in genomics. MARRVEL-MCP equips LLMs with 44 tools spanning gene and variant utilities, pathogenicity databases, phenotype resources, expression atlases, ortholog data, and literature APIs. Without hard-coded workflows, LLMs infer which tools to invoke and in what sequence, performing named-entity recognition, identifier normalization, and multi-database synthesis from clinical queries. Using 100 expert-curated questions, lightweight models (3B-20B parameters) with MARRVEL-MCP matched or outperformed larger models without tool access. A 20B-parameter model (gpt-oss-20b) achieved a 94% pass rate, versus 41% without MARRVEL-MCP, approaching state-of-the-art proprietary performance. Although expert oversight remains essential and tool use adds cost, these results show that contextual guidance can compensate for limited model capacity. These findings establish context engineering as a core principle for biomedical AI and support scalable integration of LLMs with curated genomic resources.
暂无摘要(点击查看详情)
暂无摘要(点击查看详情)
Successfully adapting to life in the highest altitudes ("Roof of the world") is a heritage of evolutionary adaptation for humans. With rising interest in adventure travel and expanding transport networks that facilitate mobility from low to high altitudes, provision of healthcare for populations living in high-altitude regions has re-emerged as an area of interest and research. These populations have several unique characteristics that limit the simple generalization of medical knowledge. First, these populations are naturally segregated into distinct ethnic groups, representing a unique marginal demographic. Second, the harsh natural environment, underdeveloped healthcare infrastructure, and limited research and understanding of healthcare needs, issues and challenges experienced by highland communities pose significant barriers to equitable healthcare access. The use of medical artificial intelligence and digital technology provides an opportunity to provide innovative solutions for these populations. However, these technologies would not facilitate health equity in their current state today as most are narrow in their application, are not trained on data representative of these regions, and ignore the multifactorial nature of being healthy that combines biological and physiological factors, in addition to environmental and socio factors. The success of generalist models for tasks such as scientific discovery provides a mechanism to leapfrog existing challenges and provide equitable care in these regions. In this paper, we discuss the opportunity of intelligent medical agents developed on generalist foundation models to meet the unique needs of high-altitude populations.
Contemporary clinical practice still produces unstructured data like free-text reports or scans, hindering automated interpretation by knowledge-based clinical decision support (CDS) systems that rely on structured data. Large language models (LLMs) show potential for interpreting such findings but face challenges in accuracy, infrastructure demands, and data privacy. Integrating LLMs with modular knowledge-based CDS systems could provide validated interpretations of such findings, but models need to call CDS modules with perfectly accurate parameters. The accuracy of multiple size classes of LLMs calling Arden Syntax Medical Logic Modules for hepatitis serology interpretation of varying complexity from unstructured multi-modal inputs is tested using a novel framework. Computationally lean LLMs like GPT-OSS were found to handle a small amount of low-complexity parameters with high accuracy, approaching clinical feasibility for private and reliable CDS interpretation of multi-modal data. Accuracy decreased sharply for tools involving more numerous or complex quantitative parameters.
Egocentric videos are inherently long-form, as they provide a continuous, first-person perspective of daily life, capturing complex social interactions and routines that naturally span days or weeks. Understanding and reasoning over egocentric videos that span hours or even days poses significant challenges due to their length, multimodal nature, and complex temporal dependencies over long time horizons. To this end, we introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e., days and weeks) egocentric videos. Ego-R1 leverages a structured Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained via reinforcement learning (RL). Inspired by human problem-solving strategies, CoTT decomposes complex reasoning into modular steps, empowering the agent to act as a high-level controller that dynamically invokes specialized tools-such as hierarchical memory retrievers and multimodal perceptors-to iteratively and collaboratively answer sub-questions. This approach enables effective temporal abstraction, long-horizon dependency tracking, and step-by-step multimodal reasoning. The framework is built upon a flexible toolkit designed for efficient temporal retrieval and granular visual analysis: Hierarchical RAG (H-RAG), a text-based module that performs efficient top-down temporal localization by aggregating video logs from day-level summaries down to 10-minute intervals; Video-LLM, a short-horizon perception module that analyzes local temporal windows to interpret dynamic interactions; and VLM, a fine-grained vision-language model used to extract high-resolution details, such as text or object attributes, from specific frames. We design a two-stage training paradigm involving supervised fine-tuning (SFT) of a pretrained language model using CoTT data, to enable dynamic tool proposal for long-range reasoning; followed by RL, to enhance the performance of plan smartly with tools. To facilitate training, we construct Ego-R1 Data, which consists of Ego-CoTT-25 K for SFT and Ego-QA-4.4 K for RL. Furthermore, we evaluate Ego-R1 on a newly curated week-long video QA benchmark, Ego-R1 Bench, which contains hybrid-source, human-verified QA pairs. Extensive experiments show that our 3B-parameter Ego-R1 Agent achieves the strongest performance among open-weight and tool-agent baselines, while offering interpretable tool-grounded reasoning trajectories. On Ego-R1 Bench, Ego-R1 achieves 46.0% accuracy, substantially outperforming Gemini-1.5-Pro (38.3%) and LLaVA-Video (29.0%); we further report Gemini-3.1-Pro as a stronger closed-source reference at 53.7%. Moreover, the framework exhibits strong generalization to standard exocentric video benchmarks; by leveraging the long-video nature of egocentric data to train the orchestrator's planning capabilities rather than overfitting the perceptors to a specific view, our modular design remains robust across domains. Ego-R1 Agent achieves 64.9% accuracy on Video-MME (long), surpassing leading open-weight models. These results validate that dynamic, tool-augmented reasoning effectively bridges the gap between limited context windows and the demands of understanding both week-long first-person experiences and general long-form video content.
暂无摘要(点击查看详情)
Bullying, a significant global issue detrimental to student well-being, is increasingly understood as a goal-directed strategy within power-imbalanced contexts. This study investigates the relationships among agentic goals, two resource control strategies (coercive and prosocial), and bullying behaviors. A sample of 1,000 Chinese adolescents (Mage = 13.6 years) completed measures of agentic goals, resource control strategies, and bullying behavior. Adopting a person-oriented approach, we first used latent profile analysis (LPA) on prosocial strategy scores to identify heterogeneous subgroups. Subsequently, variable-oriented moderated mediation models were examined within each subgroup. LPA delineated two distinct subgroups: a High Prosocial Orientation Group (73.8%) and a Low Prosocial Orientation Group (26.2%). Across the sample, agentic goals were positively associated with bullying, mediated by coercive strategies. The critical finding was that prosocial strategies moderated this mediation pathway; however, this moderated mediation effect was significant only within the High Prosocial Orientation Group. This study supports a nonpathological, goal-oriented framework for understanding bullying. The findings reveal that the protective role of prosocial strategies is conditional, effectively moderating the harmful pathway from agentic goals to bullying only among adolescents who already possess a high baseline level of such competence. This underscores the importance of interventions that address underlying motivational goals and promote prosocial skills, while also highlighting the potential need for differentiated approaches based on individuals' existing strategic repertoires.
Large language models (LLMs), initially developed for generative AI, are now evolving into agentic AI systems, which make decisions in complex, real-world contexts. Unfortunately, while their generative capabilities are well-documented, their decision-making processes remain poorly understood. This is particularly evident when testing targeted decision-making: for instance, how models handle exceptions, a critical and challenging aspect of decision-making made relevant by the inherent incompleteness of contracts. Here, we demonstrate that LLMs, even ones that excel at reasoning, deviate significantly from human judgments because they adhere strictly to policies, even when such adherence is impractical, suboptimal, or even counterproductive. We then evaluate three approaches to tuning AI agents to handle exceptions: ethical framework prompting, chain-of-thought (CoT) reasoning, and supervised fine-tuning. We find that while ethical framework prompting fails and CoT prompting provides only slight improvements, supervised fine-tuning-specifically with human explanations-yields markedly better results. Surprisingly, in our experiments, supervised fine-tuning even enabled models to generalize human-like decision-making to novel scenarios, demonstrating transfer learning of human-aligned decision-making across contexts. Furthermore, fine-tuning with explanations, not just labels, was critical for alignment, suggesting that aligning LLMs with human judgment requires explicit training on how decisions are made, not just which decisions are made. These findings highlight the need to address LLMs' shortcomings in handling exceptions in order to guide the development of agentic AI toward models that can effectively align with human judgment and simultaneously adapt to novel contexts.