搜索 — ResearchTracker

Alignment of large language models (LLMs) involves training models on preference-contrastive output pairs to adjust their responses according to human preferences. To obtain such contrastive pairs, traditional methods like RLHF and RLAIF rely on limited contrasting patterns, such as varying model variants or decoding temperatures. This singularity leads to two issues: (1) alignment is not comprehensive; and thereby (2) models are susceptible to jailbreaking attacks. To address these issues, we investigate how to construct more comprehensive and diversified contrasting patterns to enhance preference data (RQ1) and verify the impact of the diversification of contrasting patterns on model alignment (RQ2). For RQ1, we propose PopAlign, a framework that integrates diversified contrasting patterns across the prompt, model, and pipeline levels, introducing six contrasting strategies that do not require additional feedback labeling procedures. Regarding RQ2, we conduct thorough experiments demonstrating that PopAlign significantly outperforms existing methods, leading to more comprehensive alignment.

Towards More Realistic Extraction Attacks: An Adversarial Perspective

arXiv2024-07-02作者：Yash More, Prakhar Ganesh, Golnoosh Farnadi

Language models are prone to memorizing their training data, making them vulnerable to extraction attacks. While existing research often examines isolated setups, such as a single model or a fixed prompt, real-world adversaries have a considerably larger attack surface due to access to models across various sizes and checkpoints, and repeated prompting. In this paper, we revisit extraction attacks from an adversarial perspective -- with multi-faceted access to the underlying data. We find significant churn in extraction trends, i.e., even unintuitive changes to the prompt, or targeting smaller models and earlier checkpoints, can extract distinct information. By combining multiple attacks, our adversary doubles ($2 \times$) the extraction risks, persisting even under mitigation strategies like data deduplication. We conclude with four case studies, including detecting pre-training data, copyright violations, extracting personally identifiable information, and attacking closed-source models, showing how our more realistic adversary can outperform existing adversaries in the literature.

搜索结果：more

PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment

Towards More Realistic Extraction Attacks: An Adversarial Perspective

Optimal cross-correlation technique to search for strongly lensed gravitational waves

ROScopter: A Multirotor Autopilot based on ROSflight 2.0

Semi-Supervised Learning under General Causal Models

Are Large Language Models Consistent over Value-laden Questions?

Optimal Foraging in Memory Retrieval: Evaluating Random Walks and Metropolis-Hastings Sampling in Modern Semantic Spaces

Water Evolution &amp; Inventories of Super-Earths Orbiting Late M Dwarfs

A Priori Error Bounds for the Approximate Deconvolution Leray Reduced Order Model

On the cosmology and terrestrial signals of sexaquark dark matter

Chain of Thought Still Thinks Fast: APriCoT Helps with Thinking Slow

Sum rules for the Gravitational Form Factors using light-front dressed quark state

Higher Order Wiener-Wintner systems: examples and applications

Quantum Walks on the Hypercube

Language Models Understand Us, Poorly

Approximate Representations and Approximate Homomorphisms

The Formation of Quasars in Low Luminosity Hosts via Galaxy Harassment

Resolving the Structure of Cold Dark Matter Halos

Series Expansion of the Percolation Threshold on Hypercubic Lattices

The origin and tidal evolution of cuspy triaxial haloes

Water Evolution & Inventories of Super-Earths Orbiting Late M Dwarfs