搜索 — ResearchTracker

Consistency under paraphrase, the property that semantically equivalent prompts yield identical predictions, is increasingly used as a proxy for reliability when deploying medical vision-language models (VLMs). We show this proxy is fundamentally flawed: a model can achieve perfect consistency by relying on text patterns rather than the input image. We introduce a four-quadrant per-sample safety taxonomy that jointly evaluates consistency (stable predictions across paraphrased prompts) and image reliance (predictions that change when the image is removed). Samples are classified as Ideal (consistent and image-reliant), Fragile (inconsistent but image-reliant), Dangerous (consistent but not image-reliant), or Worst (inconsistent and not image-reliant). Evaluating five medical VLM configurations across two chest X-ray datasets (MIMIC-CXR, PadChest), we find that LoRA fine-tuning dramatically reduces flip rates but shifts a majority of samples into the Dangerous quadrant: LLaVA-Rad Base achieves a 1.5% flip rate on PadChest while 98.5% of its samples are Dangerous. Critically, Dangerous samples exhibit high accuracy (up to 99.6%) and low entropy, making them invisible to standard conf

Expanding External Access To Frontier AI Models For Dangerous Capability Evaluations

arXiv2026-01-17作者：Jacob Charnock, Alejandro Tlaie, Kyle O'Brien

Frontier AI companies increasingly rely on external evaluations to assess risks from dangerous capabilities before deployment. However, external evaluators often receive limited model access, limited information, and little time, which can reduce evaluation rigour and confidence. The EU General-Purpose AI Code of Practice calls for "appropriate access", but does not specify what this means in practice. Furthermore, there is no common framework for describing different types and levels of evaluator access. To address this gap, we propose a taxonomy of access methods for dangerous capability evaluations. We disentangle three aspects of access: model access, model information, and evaluation timeframe. For each aspect, we review benefits and risks, including how expanding access can reduce false negatives and improve stakeholder trust, but can also increase security and capacity challenges. We argue that these limitations can likely be mitigated through technical means and safeguards used in other industries. Based on the taxonomy, we propose three descriptive access levels: AL1 (black-box model access and minimal information), AL2 (grey-box model access and substantial information),

搜索结果：Dangerous

Consistent but Dangerous: Per-Sample Safety Classification Reveals False Reliability in Medical Vision-Language Models

Expanding External Access To Frontier AI Models For Dangerous Capability Evaluations

Technical Requirements for Halting Dangerous AI Activities

mmDrive: mmWave Sensing for Live Monitoring and On-Device Inference of Dangerous Driving

Quantifying detection rates for dangerous capabilities: a theoretical model of dangerous capability evaluations

Cybersecurity AI: The Dangerous Gap Between Automation and Autonomy

Evaluating Frontier Models for Dangerous Capabilities

Some information is too dangerous to be on the internet

Guessing human intentions to avoid dangerous situations in caregiving robots

Analytical assessment of workers' safety concerning direct and indirect ways of getting infected by dangerous pathogen

Kiki Kills: Identifying Dangerous Challenge Videos from Social Media

A Multimodal Dangerous State Recognition and Early Warning System for Elderly with Intermittent Dementia

Research on Dangerous Flight Weather Prediction based on Machine Learning

Insights Into Incitement: A Computational Perspective on Dangerous Speech on Twitter in India

Towards a Computational Analysis of Suspense: Detecting Dangerous Situations

Student Dangerous Behavior Detection in School

Understanding and Detecting Dangerous Speech in Social Media

Detection of Dangerous Events on Social Media: A Perspective Review

Generation of Threat: Crediting football players for creating dangerous actions in an unbiased way

The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models