搜索 — ResearchTracker

Large language models (LLMs) are increasingly deployed worldwide, yet their safety alignment remains predominantly English-centric. This allows for vulnerabilities in non-English contexts, especially with low-resource languages. We introduce a novel application of knowledge distillation (KD) in the context of multilingual jailbreak prevention, examining its efficacy. We distill the refusal behaviors of a proprietary teacher model (OpenAI o1-mini) with Low-Rank Adaptation (LoRA) into three open-source student models: Meta-Llama-3-8B-Instruct, Gemma-2-2B-IT, and Qwen3-8B, using ~28,000 multilingual jailbreak prompts from XSafety via black-box response-based, parameter-efficient fine-tuning (PEFT). Evaluation on the MultiJail benchmark reveals a counterintuitive behavior: standard fine-tuning on the teacher's ``safe'' refusal data inadvertently increases Jailbreak Success Rate (JSR) for all student models, up to 16.6 percentage points. Our experiments reveal a divergent generalization to unseen languages during distillation, with varying outcomes depending on the base model. By removing a primary source of safety degradation, nuanced `boundary' refusals, we mitigate or even reverse sa

Deep Nets with Subsampling Layers Unwittingly Discard Useful Activations at Test-Time

arXiv2024-10-01作者：Chiao-An Yang, Ziwei Liu, Raymond A. Yeh

Subsampling layers play a crucial role in deep nets by discarding a portion of an activation map to reduce its spatial dimensions. This encourages the deep net to learn higher-level representations. Contrary to this motivation, we hypothesize that the discarded activations are useful and can be incorporated on the fly to improve models' prediction. To validate our hypothesis, we propose a search and aggregate method to find useful activation maps to be used at test time. We applied our approach to the task of image classification and semantic segmentation. Extensive experiments over nine different architectures on multiple datasets show that our method consistently improves model test-time performance, complementing existing test-time augmentation techniques. Our code is available at https://github.com/ca-joe-yang/discard-in-subsampling.

搜索结果：unwittingly

Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety

Deep Nets with Subsampling Layers Unwittingly Discard Useful Activations at Test-Time

Turn Your Face Into An Attack Surface: Screen Attack Using Facial Reflections in Video Conferencing

Learning Normal Representations for Blood Biomarkers

Disclosure By Design: Identity Transparency as a Behavioural Property of Conversational AI Models

AI Safeguards, Generative AI and the Pandora Box: AI Safety Measures to Protect Businesses and Personal Reputation

Engineering Robustness into Personal Agents with the AI Workflow Store

Good Vibes! Towards Phone-to-User Authentication Through Wristwatch Vibrations

Propagation Dynamics of Rumor vs. Non-rumor across Multiple Social Media Platforms Driven by User Characteristics

Anticipatory Task and Motion Planning

Why Transaction Cost Economics Failed and How to Fix It

Digital logic from high-efficiency superconducting diodes

The Concerning S$H_0$ES Hubble Constant

Rethinking Pruning Large Language Models: Benefits and Pitfalls of Reconstruction Error Minimization

Refusing Safe Prompts for Multi-modal Large Language Models

Mitigating Calibration Bias Without Fixed Attribute Grouping for Improved Fairness in Medical Imaging Analysis

Towards Democratizing Joint-Embedding Self-Supervised Learning

Towards Reliable Dermatology Evaluation Benchmarks

Anticipatory Planning: Improving Long-Lived Planning by Estimating Expected Cost of Future Tasks

Side Eye: Characterizing the Limits of POV Acoustic Eavesdropping from Smartphone Cameras with Rolling Shutters and Movable Lenses