搜索结果：Harmful algae

共找到 20 条结果

高级筛选 ▾

LLMs Encode Harmfulness and Refusal Separately

arXiv

LLMs are trained to refuse harmful instructions, but do they truly understand harmfulness beyond just refusing? Prior work has shown that LLMs' refusal behaviors can be mediated by a one-dimensional subspace, i.e., a refusal direction. In this work, we identify a new dimension to analyze safety mechanisms in LLMs, i.e., harmfulness, which is encoded internally as a separate concept from refusal. There exists a harmfulness direction that is distinct from the refusal direction. As causal evidence, steering along the harmfulness direction can lead LLMs to interpret harmless instructions as harmful, but steering along the refusal direction tends to elicit refusal responses directly without reversing the model's judgment on harmfulness. Furthermore, using our identified harmfulness concept, we find that certain jailbreak methods work by reducing the refusal signals without reversing the model's internal belief of harmfulness. We also find that adversarially finetuning models to accept harmful instructions has minimal impact on the model's internal belief of harmfulness. These insights lead to a practical safety application: The model's latent harmfulness representation can serve as an i

Why Do Large Language Models Generate Harmful Content?

arXiv2026-04-13作者：Rajesh Ganguli, Raha Moraffah

Large Language Models (LLMs) have been shown to generate harmful content. However, the underlying causes of such behavior remain under explored. We propose a causal mediation analysis-based approach to identify the causal factors responsible for harmful generation. Our method performs a multi-granular analysis across model layers, modules (MLP and attention blocks), and individual neurons. Extensive experiments on state-of-the-art LLMs indicate that harmful generation arises in the later layers of the model, results primarily from failures in MLP blocks rather than attention blocks, and is associated with neurons that act as a gating mechanism for harmful generation. The results indicate that the early layers in the model are used for a contextual understanding of harmfulness in a prompt, which is then propagated through the model, to generate harmfulness in the late layers, as well as a signal indicating harmfulness through MLP blocks. This is then further propagated to the last layer of the model, specifically to a sparse set of neurons, which receives the signal and determines the generation of harmful content accordingly.

搜索结果：Harmful algae

LLMs Encode Harmfulness and Refusal Separately

Why Do Large Language Models Generate Harmful Content?

T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition

Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence

On the Sensitivity of Instruction-tuned LLMs to Harmful Sentences in Long Inputs

VisAlgae 2023: A Dataset and Challenge for Algae Detection in Microscopy Images

Structural Dynamics of Harmful Content Dissemination on WhatsApp

The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams

Harmful Random Utility Models

Improving Hydrodynamic Modeling of Free-Swimming Algae Using a Modified Three-Sphere Approach

Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation

Towards Explainable Harmful Meme Detection through Multimodal Debate between Large Language Models

Harmful Traits of AI Companions

HOD: A Benchmark Dataset for Harmful Object Detection

Making Harmful Behaviors Unlearnable for Large Language Models

Towards Comprehensive Detection of Chinese Harmful Memes

Harmful Suicide Content Detection

Harmful YouTube Video Detection: A Taxonomy of Online Harm and MLLMs as Alternative Annotators

Seg the HAB: Language-Guided Geospatial Algae Bloom Reasoning and Segmentation

Beneath the Surface: Unveiling Harmful Memes with Multimodal Reasoning Distilled from Large Language Models