搜索 — ResearchTracker

Data contamination undermines the validity of Large Language Model evaluation by enabling models to rely on memorized benchmark content rather than true generalization. While prior work has proposed contamination detection methods, these approaches are largely limited to English benchmarks, leaving multilingual contamination poorly understood. In this work, we investigate contamination dynamics in multilingual settings by fine-tuning several open-weight LLMs on varying proportions of Arabic datasets and evaluating them on original English benchmarks. To detect memorization, we extend the Tested Slot Guessing method with a choice-reordering strategy and incorporate Min-K% probability analysis, capturing both behavioral and distributional contamination signals. Our results show that translation into Arabic suppresses conventional contamination indicators, yet models still benefit from exposure to contaminated data, particularly those with stronger Arabic capabilities. This effect is consistently reflected in rising Mink% scores and increased cross-lingual answer consistency as contamination levels grow. To address this blind spot, we propose Translation-Aware Contamination Detection,

Quantifying the Effect of Test Set Contamination on Generative Evaluations

arXiv2026-01-07作者：Rylan Schaeffer, Joshua Kazdan, Baber Abbasi

As frontier AI systems are pretrained on web-scale data, test set contamination has become a critical concern for accurately assessing their capabilities. While research has thoroughly investigated the impact of test set contamination on discriminative evaluations like multiple-choice question-answering, comparatively little research has studied the impact of test set contamination on generative evaluations. In this work, we quantitatively assess the effect of test set contamination on generative evaluations through the language model lifecycle. We pretrain language models on mixtures of web data and the MATH benchmark, sweeping model sizes and number of test set replicas contaminating the pretraining corpus; performance improves with contamination and model size. Using scaling laws, we make a surprising discovery: including even a single test set replica enables models to achieve lower loss than the irreducible error of training on the uncontaminated corpus. We then study further training: overtraining with fresh data reduces the effects of contamination, whereas supervised finetuning on the training set can either increase or decrease performance on test data, depending on the am

搜索结果：contamination

Obscuring Data Contamination Through Translation: Evidence from Arabic Corpora

Quantifying the Effect of Test Set Contamination on Generative Evaluations

On The Fragility of Benchmark Contamination Detection in Reasoning Models

Subsample-Based Estimation under Dynamic Contamination

No Memorization, No Detection: Output Distribution-Based Contamination Detection in Small Language Models

The Impact of Fiber Cross Contamination on Radial Velocity Precision

Investigating Data Contamination for Pre-training Language Models

DCR: Quantifying Data Contamination in LLMs Evaluation

Rethinking the effects of data contamination in Code Intelligence

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

The Impact of Post-training on Data Contamination

Impact of Inaccurate Contamination Ratio on Robust Unsupervised Anomaly Detection

Confidence Intervals for Linear Models with Arbitrary Noise Contamination

Detecting Benchmark Contamination Through Watermarking

CAP: Data Contamination Detection via Consistency Amplification

ConStat: Performance-Based Contamination Detection in Large Language Models

Towards Contamination Resistant Benchmarks

A Taxonomy for Data Contamination in Large Language Models

When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation

Data Contamination Report from the 2024 CONDA Shared Task