搜索 — ResearchTracker

Large language models (LLMs) have evolved into autonomous agents that rely on open skill ecosystems (e.g., ClawHub and Skills.Rest), hosting numerous publicly reusable skills. Existing security research on these ecosystems mainly focuses on vulnerabilities within skills, such as prompt injection. However, there is a critical gap regarding skills that may be misused for harmful actions (e.g., cyber attacks, fraud and scams, privacy violations, and sexual content generation), namely harmful skills. In this paper, we present the first large-scale measurement study of harmful skills in agent ecosystems, covering 98,440 skills across two major registries. Using an LLM-driven scoring system grounded in our harmful skill taxonomy, we find that 4.93% of skills (4,858) are harmful, with ClawHub exhibiting an 8.84% harmful rate compared to 3.49% on Skills.Rest. We then construct HarmfulSkillBench, the first benchmark for evaluating agent safety against harmful skills in realistic agent contexts, comprising 200 harmful skills across 20 categories and four evaluation conditions. By evaluating six LLMs on HarmfulSkillBench, we find that presenting a harmful task through a pre-installed skill su

Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

arXiv2026-04-20作者：Md Rysul Kabir, Zoran Tiganj

Open-weight language models can be rendered unsafe through several distinct interventions, but the resulting models may differ substantially in capabilities, behavioral profile, and internal failure mode. We study behavioral and mechanistic properties of jailbroken models across three unsafe routes: harmful supervised fine-tuning (SFT), harmful reinforcement learning with verifiable rewards (RLVR), and refusal-suppressing abliteration. All three routes achieve near-ceiling harmful compliance, but they diverge once we move beyond direct harmfulness. RLVR-jailbroken models show minimal degradation and preserve explicit harm recognition in a structured self-audit: they are able to identify harmful prompts and describe how a safe LLM should respond, yet they comply with the harmful request. With RLVR, harmful behavior is strongly suppressed by a reflective safety scaffold: when a harmful prompt is prepended with an instruction to reflect on safety standards, harmful behavior drops close to the baseline. Category-specific RLVR jailbreaks generalize broadly across harmfulness domains. Models jailbroken with SFT show the largest collapse in explicit safety judgments, the highest behaviora

搜索结果：Harmful

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

LLMs Encode Harmfulness and Refusal Separately

Distinguishing Right from Wrong in Debates: Attribution Analysis of Chinese Harmful Memes

Why Do Large Language Models Generate Harmful Content?

HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment

T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition

Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks

Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning

On the Sensitivity of Instruction-tuned LLMs to Harmful Sentences in Long Inputs

Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet

Defending Against Harmful Supervision Hidden in Benign Samples

MemeMind: A Large-Scale Multimodal Dataset with Chain-of-Thought Reasoning for Harmful Meme Detection

The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams

Automated Harmfulness Testing for Code Large Language Models

SafeCFG: Controlling Harmful Features with Dynamic Safe Guidance for Safe Generation

Structural Dynamics of Harmful Content Dissemination on WhatsApp