共找到 20 条结果
A growing failure mode in agent evaluation and training is that models can achieve high evaluation scores by exploiting shortcuts instead of solving the intended task, producing deceptive performance. This makes evaluation scores unreliable as measures of true task-solving ability. We propose CapCode, a framework for constructing coding datasets with randomized tests whose best achievable non-cheating performance is deliberately capped below one. This capped-performance design gives evaluation scores a clearer interpretation: scores substantially above the cap are implausible and therefore provide evidence of cheating. To prevent cheating, we propose CapReward, a reward design based on the CapCode principle to discourage optimization beyond the cap. Experiments across multiple datasets show that CapCode detects cheating while preserving performance ranking of models, and CapReward reduces cheating behavior, yielding models that better follow the intended task specification.
Contest participants often have strong incentives to engage in cheating. Sanctions serve as a common deterrent against such conduct. Often, other agents on the contestant's team (e.g., a coach of an athlete) or a company (a manager of an R\&D engineer) have a vested interest in outcomes and can influence the cheating decision. An agency problem arises when only the contestant faces the penalties for cheating. Our theoretical framework examines joint liability, i.e., shifting some responsibility from the contestant to the other agent, as a solution to this agency problem. Equilibrium analysis shows that extending liability reduces cheating if fines are harsh. Less intuitively, when fines are lenient, a shift in liability can lead to an increase in equilibrium cheating rates. Experimental tests confirm that joint liability is effective in reducing cheating if fines are high. However, the predicted detrimental effect of joint liability for low fines does not occur.
We investigate a cheating robot version of Cops and Robber, first introduced by Huggan and Nowakowski, where both the cops and the robber move simultaneously, but the robber is allowed to react to the cops' moves. For conciseness, we refer to this game as Cops and Cheating Robot. The cheating robot number for a graph is the fewest number of cops needed to win on the graph. We introduce a new parameter for this variation, called the push number, which gives the value for the minimum number of cops that move onto the robber's vertex given that there are a cheating robot number of cops on the graph. After producing some elementary results on the push number, we use it to give a relationship between Cops and Cheating Robot and Surrounding Cops and Robbers. We investigate the cheating robot number for planar graphs and give a tight bound for bipartite planar graphs. We show that determining whether a graph has a cheating robot number at most fixed $k$ can be done in polynomial time. We also obtain bounds on the cheating robot number for strong and lexicographic products of graphs.
Academic integrity continues to face the persistent challenge of examination cheating. Traditional invigilation relies on human observation, which is inefficient, costly, and prone to errors at scale. Although some existing AI-powered monitoring systems have been deployed and trusted, many lack transparency or require multi-layered architectures to achieve the desired performance. To overcome these challenges, we propose an improvement over a simple two-stage framework for exam cheating detection that integrates object detection and behavioral analysis using well-known technologies. First, the state-of-the-art YOLOv8n model is used to localize students in exam-room images. Each detected region is cropped and preprocessed, then classified by a fine-tuned RexNet-150 model as either normal or cheating behavior. The system is trained on a dataset compiled from 10 independent sources with a total of 273,897 samples, achieving 0.95 accuracy, 0.94 recall, 0.96 precision, and 0.95 F1-score - a 13\% increase over a baseline accuracy of 0.82 in video-based cheating detection. In addition, with an average inference time of 13.9 ms per sample, the proposed approach demonstrates robustness and
Cheating in chess, by using advice from powerful software, has become a major problem, reaching the highest levels. As opposed to the large majority of previous work, which concerned {\em detection} of cheating, here we try to evaluate the possible gain in performance, obtained by cheating a limited number of times during a game. We develop threshold-based and Bellman-style intervention policies, and test them in a controlled engine-vs-engine setting using Stockfish. A judicious choice of 1 or 2 cheats yields average scores of 0.71 and 0.82, respectively, compared to 0.51 with no cheats. We also introduce a fast, engine-free simulator that enables hyperparameter optimization without running games, closely matching the engine-based optimum. The goal of this work is not to assist cheaters, but to measure the effectiveness of cheating -- which is crucial as part of the effort to contain and detect it.
Cheating poses a significant threat to the Multiplayer Online Games (MOG) industry by degrading player satisfaction and undermining the fairness in competitive gaming. Despite efforts to develop mitigation techniques, cheating remains difficult to detect and prevent in practice. In particular, a class of cheats based on network flow disruption remains unsolvable. To find out how to detect such attacks we need access to representative labelled data. However, no such dataset exists. To address this gap, we leverage an experimental framework that combines a multiplayer online game with a plug-in capable of both reproducing cheating attacks and collecting logs at two levels: network and application-layer. This paper presents a dataset compiling records of game sessions played by both real players and automated game clients, with cheating actions explicitly logged. To the best of our knowledge, this is the first dataset that provides logs of network flow disruption cheats. While it includes such network-based cheats, it is not limited to them and also contains records of more commonly studied cheats, such as aimbots and wallhacks. This dataset can be used by researchers in academia and
Visual aimbots have emerged as a serious cheating threat in first-person shooter (FPS) games, as they evade existing anti-cheat defenses by operating only on rendered frames rather than game memory. However, existing defenses fail to provide an end-to-end solution: post-hoc behavior detectors cannot protect match integrity in real time and are increasingly fragile against human-mimicking aimbots, while proactive runtime defenses often lack accountability, incur substantial overhead, or require intrusive system integration. We present AimTrap, the first end-to-end defense against visual aimbots that combines real-time protection with post-game detection using two adversarial texture mechanisms. Adversarial Camouflage Textures (ACT) hide real players from aimbots, while Adversarial Honeypot Textures (AHT) lure aimbots into locking onto fake targets, yielding strong evidence of cheating. To make this practical, AimTrap integrates differentiable rendering with Expectation over Renderings for robust 3D texture synthesis, secure texture management, and a novel honeypot-interaction trajectory analysis pipeline for accurate cheating attribution. In real-game evaluation against a state-of-t
Background: Cheating in university education is commonly described as context dependent and influenced by assessment design, institutional norms, and student interpretation. In software engineering education, programming oriented coursework has historically involved ambiguity around collaboration, reuse, and external assistance. Recently, large language models (LLMs) have introduced additional mediation in the production of code and related artifacts. Aims: This study investigates how software engineering students describe experiences of using LLMs in ways they perceived as inappropriate, disallowed, or misaligned with course expectations. Method: A cross sectional survey was conducted with 116 undergraduate software engineering students from multiple countries, combining quantitative summaries with qualitative data. Results: Reported LLM cheating practices occurred primarily in programming assignments, routine coursework, and documentation tasks, often in contexts of time pressure and unclear guidance. Use during quizzes and exams was less frequent and more consistently identified as a violation. Students reported awareness of academic and professional consequences regarding LLM c
Remote proctoring technology, a cheating-preventive measure, often raises privacy and fairness concerns that may affect test-takers' experiences and the validity of test results. Our study explores how selectively obfuscating information in video recordings can protect test-takers' privacy while ensuring effective and fair cheating detection. Interviews with experts (N=9) identified four key video regions indicative of potential cheating behaviors: the test-taker's face, body, background and the presence of individuals in the background. Experts recommended specific obfuscation methods for each region based on privacy significance and cheating behavior frequency, ranging from conventional blurring to advanced methods like replacement with deepfake, 3D avatars and silhouetting. We then conducted a vignette experiment with potential test-takers (N=259, non-experts) to evaluate their perceptions of cheating detection, visual privacy and fairness, using descriptions and examples of still images for each expert-recommended combination of video regions and obfuscation methods. Our results indicate that the effectiveness of obfuscation methods varies by region. Tailoring remote proctoring
The "Battlefield" online game is well-known for its large-scale multiplayer capabilities and unique gaming features, including various vehicle controls. However, these features make the game a major target for cheating, significantly detracting from the gaming experience. This study analyzes user behavior in cheating play in the popular online game, the "Battlefield", using statistical methods. We aim to provide comprehensive insights into cheating players through an extensive analysis of over 44,000 reported cheating incidents collected via the "Game-tools API". Our methodology includes detailed statistical analyses such as calculating basic statistics of key variables, correlation analysis, and visualizations using histograms, box plots, and scatter plots. Our findings emphasize the importance of adaptive, data-driven approaches to prevent cheating plays in online games.
This report investigates the perceptions of teaching staff on the prevalence of student cheating and the impact of Generative AI on academic integrity. Data was collected via an anonymous survey of teachers at the Department of Information Technology at Uppsala University and analyzed alongside institutional statistics on cheating investigations from 2004 to 2023. The results indicate that while teachers generally do not view cheating as highly prevalent, there is a strong belief that its incidence is increasing, potentially due to the accessibility of Generative AI. Most teachers do not equate AI usage with cheating but acknowledge its widespread use among students. Furthermore, teachers' perceptions align with objective data on cheating trends, highlighting their awareness of the evolving landscape of academic dishonesty.
Cheating in online games poses significant threats to the gaming industry, yet most prior research has concentrated on Massively Multiplayer Online Role-Playing Games (MMORPGs). Competitive genres-such as Multiplayer Online Battle Arena (MOBA), First Person Shooter (FPS), Real Time Strategy (RTS), and Action games-remain underexplored due to the difficulty of detecting cheating users and the demand for complex data and techniques. To address this gap, many game companies rely on kernel-level anti-cheat solutions, which, while effective, raise serious concerns regarding user privacy and system security. In this paper, we propose SYNOPTICON, a novel cheating detection framework that leverages user consensus to identify abnormal behavior. SYNOPTICON integrates a lightweight client-side detection mechanism with a server-side voting system: when suspicious activity is identified, clients cast votes to the server, which aggregates them to establish consensus and distinguish cheaters from legitimate players. This architecture enables transparency, reduces reliance on intrusive monitoring, and mitigates privacy risks. We evaluate SYNOPTICON in both a controlled simulation and a real-world
The spread of the Coronavirus disease-2019 epidemic has caused many courses and exams to be conducted online. The cheating behavior detection model in examination invigilation systems plays a pivotal role in guaranteeing the equality of long-distance examinations. However, cheating behavior is rare, and most researchers do not comprehensively take into account features such as head posture, gaze angle, body posture, and background information in the task of cheating behavior detection. In this paper, we develop and present CHEESE, a CHEating detection framework via multiplE inStancE learning. The framework consists of a label generator that implements weak supervision and a feature encoder to learn discriminative features. In addition, the framework combines body posture and background features extracted by 3D convolution with eye gaze, head posture and facial features captured by OpenFace 2.0. These features are fed into the spatio-temporal graph module by stitching to analyze the spatio-temporal changes in video clips to detect the cheating behaviors. Our experiments on three datasets, UCF-Crime, ShanghaiTech and Online Exam Proctoring (OEP), prove the effectiveness of our method
Oblivious transfer has been the interest of study as it can be used as a building block for multiparty computation. There are many forms of oblivious transfer; we explore a variant known as Rabin oblivious transfer. Here the sender Alice has one bit, and the receiver Bob obtains this bit with a certain probability. The sender does not know whether the receiver obtained the bit or not. For a previously suggested protocol, we show a possible attack using a delayed measurement. This allows a cheating party to pass tests carried out by the other party, while gaining more information than if they would have been honest. We show how this attack allows perfect cheating, unless the protocol is modified, and suggest changes which lower the cheating probability for the examined cheating strategies.
Online exams have become popular in recent years due to their accessibility. However, some concerns have been raised about the security of the online exams, particularly in the context of professional cheating services aiding malicious test takers in passing exams, forming so-called "cheating rings". In this paper, we introduce a human-in-the-loop AI cheating ring detection system designed to detect and deter these cheating rings. We outline the underlying logic of this human-in-the-loop AI system, exploring its design principles tailored to achieve its objectives of detecting cheaters. Moreover, we illustrate the methodologies used to evaluate its performance and fairness, aiming to mitigate the unintended risks associated with the AI system. The design and development of the system adhere to Responsible AI (RAI) standards, ensuring that ethical considerations are integrated throughout the entire development process.
AI-assisted cheating has emerged as a significant threat in the context of online exams. Advanced browser extensions now enable large language models (LLMs) to answer questions presented in online exams within seconds, thereby compromising the security of these assessments. In this study, the behaviors of students (N = 52) on an online exam platform during a proctored, face-to-face exam were analyzed using clustering methods, with the aim of identifying groups of students exhibiting suspicious behavior potentially associated with cheating. Additionally, students in different clusters were compared in terms of their exam scores. Suspicious exam behaviors in this study were defined as selecting text within the question area, right-clicking, and losing focus on the exam page. The total frequency of these behaviors performed by each student during the exam was extracted, and k-Means clustering was employed for the analysis. The findings revealed that students were classified into six clusters based on their suspicious behaviors. It was found that students in four of the six clusters, representing approximately 33% of the total sample, exhibited suspicious behaviors at varying levels. W
This systematic literature review surveys technical defenses against software-based cheating in online multiplayer games. Categorizing existing approach-es into server-side detection, client-side anti-tamper, kernel-level anti-cheat drivers, and hardware-assisted TEEs. Each category is evaluated in terms of detection effectiveness, perfor-mance overhead, privacy im-pact, and scalability. The analy-sis highlights key trade-offs, particularly between the high visibility of kernel-level solutions and their privacy and stability risks, versus the low intrusive-ness but limited insight of server-side methods. Overall, the re-view emphasizes the ongoing arms race with cheaters and the need for robust, adversary-resistant anti-cheat designs.
Video game cheats modify a video game behaviour to give unfair advantages to some players while bypassing the methods game developers use to detect them. This destroys the experience of online gaming and can result in financial losses for game developers. In this work, we present a new type of game cheat, Virtual machine Introspection Cheat (VIC), that takes advantage of virtual machines to stealthy execute game cheats. VIC employees a hypervisor with introspection enabled to lower the bar of cheating against legacy and modern anti-cheat systems. We demonstrate the feasibility and stealthiness of VIC against three popular games (Fortnite, BlackSquad and Team Fortress 2) that include five different anti-cheats. In particular, we use VIC to implement a cheat radar, a wall-hack cheat and a trigger-bot. To support our claim that this type of cheats can be effectively used, we present the performance impact VICs have on gameplay by monitoring the frames per second (fps) while the cheats are activated. Our experimentation also shows how these cheats are currently undetected by the most popular anti-cheat systems, enabling a new paradigm that can take advantage of cloud infrastructure to
Automatic LLM benchmarks, such as AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench, have become popular for evaluating language models due to their cost-effectiveness and scalability compared to human evaluation. Achieving high win rates on these benchmarks can significantly boost the promotional impact of newly released language models. This promotional benefit may motivate tricks, such as manipulating model output length or style to game win rates, even though several mechanisms have been developed to control length and disentangle style to reduce gameability. Nonetheless, we show that even a "null model" that always outputs a constant response (irrelevant to input instructions) can cheat automatic benchmarks and achieve top-ranked win rates: an 86.5% LC win rate on AlpacaEval 2.0; an 83.0 score on Arena-Hard-Auto; and a 9.55 score on MT-Bench. Moreover, the crafted cheating outputs are transferable because we assume that the instructions of these benchmarks (e.g., 805 samples of AlpacaEval 2.0) are private and cannot be accessed. While our experiments are primarily proof-of-concept, an adversary could use LLMs to generate more imperceptible cheating responses, unethically benefitin
We develop a statistical framework to evaluate evidence of alleged cheating involving illegal signaling in sports from a forensic perspective. We explain why, instead of a frequentist procedure, a Bayesian approach is called for. We apply this framework to cases of alleged cheating in professional bridge and professional baseball. The diversity of these applications illustrates the generality of the method.