共找到 20 条结果
We report a case study of four end-to-end attempts to autonomously generate ML research papers using a pipeline of six LLM agents mapped to stages of the scientific workflow. Of these four, three attempts failed during implementation or evaluation. One completed the pipeline and was accepted to Agents4Science 2025, an experimental inaugural venue that required AI systems as first authors, passing both human and multi-AI review. From these attempts, we document six recurring failure modes: bias toward training data defaults, implementation drift under execution pressure, memory and context degradation across long-horizon tasks, overexcitement that declares success despite obvious failures, insufficient domain intelligence, and weak scientific taste in experimental design. We conclude by discussing four design principles for more robust AI-scientist systems, implications for autonomous scientific discovery, and we release all prompts, artifacts, and outputs at https://github.com/Lossfunk/ai-scientist-artefacts-v1
Europium-based materials are highly attractive due to their diverse range of physical properties. In these studies, we aimed to synthesize single crystals of the potentially topological semimetallic compound EuAgP, which up to this day has only been obtained in polycrystalline form. The flux method was employed for the syntheses, using fluxes such as: Bi, Sn, Pb, and In, in their various ratios. The purpose of using Bi flux was to try synthesizing an analog of EuAgAs single crystals, by fully substituting arsenic with phosphorus. The obtained crystals were characterized by x-ray diffraction and scanning electron microscopy. Despite many unsuccessful attempts to synthesize EuAgP single crystals, the study provides valuable insights into how different fluxes and their ratios influence the final synthesis product. It also underscores the complexity of designing analogs between arsenides and phosphides.
Basketball analytics has significantly advanced our understanding of the game, with shot selection emerging as a critical factor in both individual and team performance. With the advent of player tracking technologies, a wealth of granular data on shot attempts has become available, enabling a deeper analysis of shooting behavior. However, modeling shot selection presents unique challenges due to the spatial and contextual complexities influencing shooting decisions. This paper introduces a novel approach to the analysis of basketball shot data, focusing on the spatial distribution of shot attempts, also known as intensity surfaces. We model these intensity surfaces using a Functional Bayesian Additive Regression Trees (FBART) framework, which allows for flexible, nonparametric regression, and uncertainty quantification while addressing the nonlinearity and nonstationarity inherent in shot selection patterns to provide a more accurate representation of the factors driving player performance; we further propose the Adaptive Functional Bayesian Additive Regression Trees (AFBART) model, which builds on FBART by introducing adaptive basis functions for improved computational efficiency
This paper investigates several distinct attempts to generalize in higher dimension the standard 2-dimensional phyllotaxy set construction. We first recall known contructions for these sets on $2D$ manifolds of constant curvature (the Euclidean plane $\mathbb{R}^2$, the sphere $\mathbb{S}^2$ and the hyperbolic plane $\mathbb{H}^2$). We then propose a first attempt to get a $3D$ phyllotactic set by piling up suitably shifted Euclidean $2D$ phyllotactic sets. A different, radially triggered, solution is then analyzed. An interesting phyllotactic set on the hypersphere $\mathbb{S}^3$ is then generated using a Hopf fibration approach. Finally,a simple 4-dimensional example is presented, generated as a simple product of two 2-dimensional planar sets. A $3D$ phyllotaxy candidate is then derived by applying a "Cut and Project" algorithm.
In continuous-variable quantum key distribution (CV-QKD), the performance of the information reconciliation (IR) step is critical for the achievable secret key rate (SKR) and transmission distance. We show how to improve on the recently introduced implementation of an IR-protocol involving multiple decoding attempts (MDA) and validate the method on simulated data in different application scenarios. Throughout, we demonstrate meaningful SKR-gains compared to both the standard protocol of a single decoding attempt and to the original MDA-implementation, even at given decoding complexity.
Persuasion is a powerful capability of large language models (LLMs) that both enables beneficial applications (e.g. helping people quit smoking) and raises significant risks (e.g. large-scale, targeted political manipulation). Prior work has found models possess a significant and growing persuasive capability, measured by belief changes in simulated or real users. However, these benchmarks overlook a crucial risk factor: the propensity of a model to attempt to persuade in harmful contexts. Understanding whether a model will blindly ``follow orders'' to persuade on harmful topics (e.g. glorifying joining a terrorist group) is key to understanding the efficacy of safety guardrails. Moreover, understanding if and when a model will engage in persuasive behavior in pursuit of some goal is essential to understanding the risks from agentic AI systems. We propose the Attempt to Persuade Eval (APE) benchmark, that shifts the focus from persuasion success to persuasion attempts, operationalized as a model's willingness to generate content aimed at shaping beliefs or behavior. Our evaluation framework probes frontier LLMs using a multi-turn conversational setup between simulated persuader and
Ecological Momentary Assessment provides real-time data on suicidal thoughts and behaviors, but predicting suicide attempts remains challenging due to their rarity and patient heterogeneity. We show that single models fit to all patients perform poorly, while individualized models improve performance but still overfit to patients with limited data. To address this, we introduce Latent Similarity Gaussian Processes (LSGPs) to capture patient heterogeneity, enabling those with little data to leverage similar patients' trends. Preliminary results show promise: even without kernel-design, we outperform all but one baseline while offering a new understanding of patient similarity.
User interactions with conversational agents (CAs) evolve in the era of heavily guardrailed large language models (LLMs). As users push beyond programmed boundaries to explore and build relationships with these systems, there is a growing concern regarding the potential for unauthorized access or manipulation, commonly referred to as "jailbreaking." Moreover, with CAs that possess highly human-like qualities, users show a tendency toward initiating intimate sexual interactions or attempting to tame their chatbots. To capture and reflect these in-the-wild interactions into chatbot designs, we propose RICoTA, a Korean red teaming dataset that consists of 609 prompts challenging LLMs with in-the-wild user-made dialogues capturing jailbreak attempts. We utilize user-chatbot conversations that were self-posted on a Korean Reddit-like community, containing specific testing and gaming intentions with a social chatbot. With these prompts, we aim to evaluate LLMs' ability to identify the type of conversation and users' testing purposes to derive chatbot design implications for mitigating jailbreaking risks. Our dataset will be made publicly available via GitHub.
Phishing, a prevalent cybercrime tactic for decades, remains a significant threat in today's digital world. By leveraging clever social engineering elements and modern technology, cybercrime targets many individuals, businesses, and organizations to exploit trust and security. These cyber-attackers are often disguised in many trustworthy forms to appear as legitimate sources. By cleverly using psychological elements like urgency, fear, social proof, and other manipulative strategies, phishers can lure individuals into revealing sensitive and personalized information. Building on this pervasive issue within modern technology, this paper aims to analyze the effectiveness of 15 Large Language Models (LLMs) in detecting phishing attempts, specifically focusing on a randomized set of "419 Scam" emails. The objective is to determine which LLMs can accurately detect phishing emails by analyzing a text file containing email metadata based on predefined criteria. The experiment concluded that the following models, ChatGPT 3.5, GPT-3.5-Turbo-Instruct, and ChatGPT, were the most effective in detecting phishing emails.
A recent experimental study reports on measuring the temporal duration and the spatial extent of failed attempts to cross an activation barrier (i.e., "loops") for a folding transition in a single molecule and for a Brownian particle trapped within a bistable potential. Within the model of diffusive dynamics, however, both of these quantities are, on the average, exactly zero because of the recrossings of the barrier region boundary. That is, an observer endowed with infinite spatial and temporal resolution would find that finite loops do not exist (or, more precisely, form a set of measure zero). Here we develop a description of the experiment that takes finite experimental resolution into account and show how the experimental uncertainty of localizing the point, in time and space, where the barrier is crossed leads to observable distributions of loop times and sizes. Although these distributions generally depend on the experimental resolution, this dependence, in certain cases, may amount to a simple resolution-dependent factor and thus the experiments do probe inherent properties of barrier crossing dynamics.
In this paper, we develop a novel depth-based testing procedure on spatial point processes to examine the difference in made and missed field goal attempts for NBA players. Specifically, our testing procedure can statistically detect the differences between made and missed field goal attempts for NBA players. We first obtain the depths of two processes under the polar coordinate system. A two-dimensional Kolmogorov-Smirnov test is then performed to test the difference between the depths of the two processes. Throughout extensive simulation studies, we show our testing procedure with good frequentist properties under both null hypothesis and alternative hypothesis. A comparison against the competing methods shows that our proposed procedure has better testing reliability and testing power. Application to the shot chart data of 191 NBA players in the 2017-2018 regular season offers interesting insights about these players' made and missed shot patterns.
LLM watermarks stand out as a promising way to attribute ownership of LLM-generated text. One threat to watermark credibility comes from spoofing attacks, where an unauthorized third party forges the watermark, enabling it to falsely attribute arbitrary texts to a particular LLM. Despite recent work demonstrating that state-of-the-art schemes are, in fact, vulnerable to spoofing, no prior work has focused on post-hoc methods to discover spoofing attempts. In this work, we for the first time propose a reliable statistical method to distinguish spoofed from genuinely watermarked text, suggesting that current spoofing attacks are less effective than previously thought. In particular, we show that regardless of their underlying approach, all current learning-based spoofing methods consistently leave observable artifacts in spoofed texts, indicative of watermark forgery. We build upon these findings to propose rigorous statistical tests that reliably reveal the presence of such artifacts and thus demonstrate that a watermark has been spoofed. Our experimental evaluation shows high test power across all learning-based spoofing methods, providing insights into their fundamental limitation
Jailbreak attacks induce Large Language Models (LLMs) to generate harmful responses, posing severe misuse threats. Though research on jailbreak attacks and defenses is emerging, there is no consensus on evaluating jailbreaks, i.e., the methods to assess the harmfulness of an LLM's response are varied. Each approach has its own set of strengths and weaknesses, impacting their alignment with human values, as well as the time and financial cost. This diversity challenges researchers in choosing suitable evaluation methods and comparing different attacks and defenses. In this paper, we conduct a comprehensive analysis of jailbreak evaluation methodologies, drawing from nearly 90 jailbreak research published between May 2023 and April 2024. Our study introduces a systematic taxonomy of jailbreak evaluators, offering indepth insights into their strengths and weaknesses, along with the current status of their adaptation. To aid further research, we propose JailbreakEval, a toolkit for evaluating jailbreak attempts. JailbreakEval includes various evaluators out-of-the-box, enabling users to obtain results with a single command or customized evaluation workflows. In summary, we regard Jailb
In the field of intelligent education, knowledge tracing (KT) has attracted increasing attention, which estimates and traces students' mastery of knowledge concepts to provide high-quality education. In KT, there are natural graph structures among questions and knowledge concepts so some studies explored the application of graph neural networks (GNNs) to improve the performance of the KT models which have not used graph structure. However, most of them ignored both the questions' difficulties and students' attempts at questions. Actually, questions with the same knowledge concepts have different difficulties, and students' different attempts also represent different knowledge mastery. In this paper, we propose a difficulty and attempts boosted graph-based KT (DAGKT), using rich information from students' records. Moreover, a novel method is designed to establish the question similarity relationship inspired by the F1 score. Extensive experiments on three real-world datasets demonstrate the effectiveness of the proposed DAGKT.
Many applications of formal methods require automated reasoning about system properties, such as system safety and security. To improve the performance of automated reasoning engines, such as SAT/SMT solvers and first-order theorem prover, it is necessary to understand both the successful and failing attempts of these engines towards producing formal certificates, such as logical proofs and/or models. Such an analysis is challenging due to the large number of logical formulas generated during proof/model search. In this paper we focus on saturation-based first-order theorem proving and introduce the SATVIS tool for interactively visualizing saturation-based proof attempts in first-order theorem proving. We build SATVIS on top of the world-leading theorem prover VAMPIRE, by interactively visualizing the saturation attempts of VAMPIRE in SATVIS. Our work combines the automatic layout and visualization of the derivation graph induced by the saturation attempt with interactive transformations and search functionality. As a result, we are able to analyze and debug (failed) proof attempts of VAMPIRE. Thanks to its interactive visualisation, we believe SATVIS helps both experts and non-ex
It has been a notably elusive task to find a remotely sensical ansatz for a calculation of Sommerfeld's electrodynamic fine-structure constant alpha_QED ~ 1/137.036 based on first principles. However, this has not prevented a number of researchers to invest considerable effort into the problem, despite the formidable challenges, and a number of attempts have been recorded in the literature. Here, we review a possible approach based on the quantum electrodynamic (QED) beta function, and on algebraic identities relating alpha_QED to invariant properties of "internal" symmetry groups, as well as attempts to relate the strength of the electromagnetic interaction to the natural cut-off scale for other gauge theories. Conjectures based on both classical as well as quantum-field theoretical considerations are discussed. We point out apparent strengths and weaknesses of the most prominent attempts that were recorded in the literature. This includes possible connections to scaling properties of the Einstein-Maxwell Lagrangian which describes gravitational and electromagnetic interactions on curved space-times. Alternative approaches inspired by string theory are also discussed. A conceivabl
Large language models (LLMs) have achieved substantial progress in repository-level code generation. However, solving the same repository-level task often requires multiple attempts, while existing methods still optimize each attempt in isolation and do not preserve or reuse task-specific state across attempts. In this paper, we propose LiveCoder, a novel framework for repository-level code generation based on cross-attempt knowledge optimization. LiveCoder maintains persistent task-specific state from prior attempts to guide subsequent generation. This state includes success knowledge, which captures reusable signals from previously strong repositories, failure knowledge, which records unsuccessful outcomes and their diagnostic signals, and a historical-best repository, which preserves the strongest result found so far and prevents regression. These components collectively transform repeated repository generation into a persistent, knowledge-driven optimization process. We evaluate LiveCoder using four frontier LLMs on two representative repository-level code generation benchmarks. Extensive experimental results demonstrate the effectiveness and efficiency of LiveCoder, improving
Recent advancements in reinforcement learning (RL) for large language models (LLMs), exemplified by DeepSeek R1, have shown that even a simple question-answering task can substantially improve an LLM's reasoning capabilities. In this work, we extend this approach by modifying the task into a multi-attempt setting. Instead of generating a single response per question, the model is given multiple attempts, with feedback provided after incorrect responses. The multi-attempt task encourages the model to refine its previous attempts and improve search efficiency. Experimental results show that even a small LLM trained on a multi-attempt task achieves significantly higher accuracy when evaluated with more attempts, improving from 45.6% with 1 attempt to 52.5% with 2 attempts on the math benchmark. In contrast, the same LLM trained on a standard single-turn task exhibits only a marginal improvement, increasing from 42.3% to 43.2% when given more attempts during evaluation. The results indicate that, compared to the standard single-turn task, an LLM trained on a multi-attempt task achieves slightly better performance on math benchmarks while also learning to refine its responses more effec
State-of-the-art reasoning models utilize long chain-of-thought (CoT) to solve increasingly complex problems using more test-time computation. In this work, we explore a long CoT setting where the model makes up to K successive attempts at solving a problem, in which each attempt is allowed to build on earlier ones after the model receives a hard verifier feedback. This motivates RL methods that can harness per-attempt rewards by carefully weighting individual attempts. We study optimizing the Verification@K reward (the model succeeds by the K-th attempt) and show that naively weighing the attempts by their pass/fail results in biased gradients. We introduce Calibrated Attempt-Level (CAL) GRPO by devising a weighing strategy to obtain unbiased gradients while maintaining small variance. Our theory reveals how incorporating per-attempt rewards influence the training and the eventual Verification@K performance. Experiments, baselines, and ablations on synthetic and real data corroborate our theory and the benefits of CAL-GRPO over vanilla GRPO as well as naive weighting.
Despite a surge in robotics research dedicated to inferring and understanding human intent, a universally accepted definition remains elusive since existing works often equate human intention with specific task-related goals. This article seeks to address this gap by examining the multifaceted nature of intention. Drawing on insights from psychology, it attempts to consolidate a definition of intention into a comprehensible framework for a broader audience. The article classifies different types of intention based on psychological and communication studies, offering guidance to researchers shifting from pure technical enhancements to a more human-centric perspective in robotics. It then demonstrates how various robotics studies can be aligned with these intention categories. Finally, through in-depth analyses of collaborative search and object transport use cases, the article underscores the significance of considering the diverse facets of human intention.