Training large language models (LLMs) on Python execution traces grounds them in code execution and enables the line-by-line execution prediction of whole Python programs, effectively turning them into neural interpreters (FAIR CodeGen Team et al., 2025). However, developers rarely execute programs step by step; instead, they use debuggers to stop execution at certain breakpoints and step through relevant portions only while inspecting or modifying program variables. Existing neural interpreter approaches lack such interactive control. To address this limitation, we introduce neural debuggers: language models that emulate traditional debuggers, supporting operations such as stepping into, over, or out of functions, as well as setting breakpoints at specific source lines. We show that neural debuggers -- obtained via fine-tuning large LLMs or pre-training smaller models from scratch -- can reliably model both forward execution (predicting future states and outputs) and inverse execution (inferring prior states or inputs) conditioned on debugger actions. Evaluated on CruxEval, our models achieve strong performance on both output and input prediction tasks, demonstrating robust condit
Field-Programmable Gate Array (FPGA) development tool chains are widely used in FPGA design, simulation, and verification in critical areas like communications, automotive electronics, and aerospace. Commercial FPGA tool chains such as Xilinx' Vivado aids developers in swiftly identifying and rectifying bugs and issues in FPGA designs through a robust built-in debugger, ensuring the correctness and development efficiency of the FPGA design. Hardening such FPGA chip debugger tools by testing is crucial since engineers might misinterpret code and introduce incorrect fixes, leading to security risks. However, FPGA chip debugger tools are challenging to test as they require assessing both RTL designs and a series of debugging actions, including setting breakpoints and stepping through the code. To address this issue, we propose a interactive differential testing approach called DB-Hunter to detect bugs in Vivado's FPGA chip debugger tools. Specifically, DB-Hunter consists of three components: RTL design transformation component, debug action transformation component, and interactive differential testing component. By performing RTL design and debug action transformations, DB-Hunter gen
The Visual Debugger is an IntelliJ IDEA plugin that presents debug information as an object diagram to enhance program understanding. Reflecting on our past development, we detail the lessons learned and roadblocks we have experienced while implementing and integrating the Visual Debugger into the IntelliJ IDEA. Furthermore, we describe recent improvements to the Visual Debugger, greatly enhancing the plugin in the present. Looking into the future, we propose solutions to overcome the roadblocks encountered while developing the plugin and further plans for the Visual Debugger.
Debugging is a fundamental skill that novice programmers must develop. Numerous tools have been created to assist novice programmers in this process. Recently, large language models (LLMs) have been integrated with automated program repair techniques to generate fixes for students' buggy code. However, many of these tools foster an over-reliance on AI and do not actively engage students in the debugging process. In this work, we aim to design an intuitive debugging assistant, CodeHinter, that combines traditional debugging tools with LLM-based techniques to help novice debuggers fix semantic errors while promoting active engagement in the debugging process. We present findings from our second design iteration, which we tested with a group of undergraduate students. Our results indicate that the students found the tool highly effective in resolving semantic errors and significantly easier to use than the first version. Consistent with our previous study, error localization was the most valuable feature. Finally, we conclude that any AI-assisted debugging approach should be personalized based on user profiles to optimize their interactions with the tool.
Automated debugging, long pursued in a variety of fields from software engineering to cybersecurity, requires a framework that offers the building blocks for a programmable debugging workflow. However, existing debuggers are primarily tailored for human interaction, and those designed for programmatic debugging focus on kernel space, resulting in limited functionality in userland. To fill this gap, we introduce libdebug, a Python library for programmatic debugging of userland binary executables. libdebug offers a user-friendly API that enables developers to build custom debugging tools for various applications, including software engineering, reverse engineering, and software security. It is released as an open-source project, along with comprehensive documentation to encourage use and collaboration across the community. We demonstrate the versatility and performance of libdebug through case studies and benchmarks, all of which are publicly available. We find that the median latency of syscall and breakpoint handling in libdebug is 3 to 4 times lower compared to that of GDB.
Large Language Models (LLMs) frequently generate buggy code with complex logic errors that are challenging to diagnose. While existing LLM-based self-repair approaches conduct intensive static semantic analysis or reply on superficial execution logs, they miss the in-depth runtime behaviors that often expose bug root causes-lacking the interactive dynamic analysis capabilities that make human debugging effective. We present InspectCoder, the first agentic program repair system that empowers LLMs to actively conduct dynamic analysis via interactive debugger control. Our dual-agent framework enables strategic breakpoint placement, targeted state inspection, and incremental runtime experimentation within stateful debugger sessions. Unlike existing methods that follow fixed log collection procedures, InspectCoder adaptively inspects and perturbs relevant intermediate states at runtime, and leverages immediate process rewards from debugger feedback to guide multi-step reasoning, transforming LLM debugging paradigm from blind trial-and-error into systematic root cause diagnosis. We conduct comprehensive experiments on two challenging self-repair benchmarks: BigCodeBench-R and LiveCodeBen
The opaque nature and unexplained behavior of transformer-based language models (LMs) have spurred a wide interest in interpreting their predictions. However, current interpretation methods mostly focus on probing models from outside, executing behavioral tests, and analyzing salience input features, while the internal prediction construction process is largely not understood. In this work, we introduce LM-Debugger, an interactive debugger tool for transformer-based LMs, which provides a fine-grained interpretation of the model's internal prediction process, as well as a powerful framework for intervening in LM behavior. For its backbone, LM-Debugger relies on a recent method that interprets the inner token representations and their updates by the feed-forward layers in the vocabulary space. We demonstrate the utility of LM-Debugger for single-prediction debugging, by inspecting the internal disambiguation process done by GPT2. Moreover, we show how easily LM-Debugger allows to shift model behavior in a direction of the user's choice, by identifying a few vectors in the network and inducing effective interventions to the prediction process. We release LM-Debugger as an open-source
Compiler diagnostics for type inference failures are notoriously bad, and type classes only make the problem worse. By introducing a complex search process during inference, type classes can lead to wholly inscrutable or useless errors. We describe a system, Argus, for interactively visualizing type class inferences to help programmers debug inference failures, applied specifically to Rust's trait system. The core insight of Argus is to avoid the traditional model of compiler diagnostics as one-size-fits-all, instead providing the programmer with different views on the search tree corresponding to different debugging goals. Argus carefully uses defaults to improve debugging productivity, including interface design (e.g., not showing full paths of types by default) and heuristics (e.g., sorting obligations based on the expected complexity of fixing them). We evaluated Argus in a user study where $N = 25$ participants debugged type inference failures in realistic Rust programs, finding that participants using Argus correctly localized $2.2\times$ as many faults and localized $3.3\times$ faster compared to not using Argus.
Large Language Models (LLMs) have shown incredible potential in code generation tasks, and recent research in prompt engineering have enhanced LLMs' understanding of textual information. However, ensuring the accuracy of generated code often requires extensive testing and validation by programmers. While LLMs can typically generate code based on task descriptions, their accuracy remains limited, especially for complex tasks that require a deeper understanding of both the problem statement and the code generation process. This limitation is primarily due to the LLMs' need to simultaneously comprehend text and generate syntactically and semantically correct code, without having the capability to automatically refine the code. In real-world software development, programmers rarely produce flawless code in a single attempt based on the task description alone, they rely on iterative feedback and debugging to refine their programs. Inspired by this process, we introduce a novel architecture of LLM-based agents for code generation and automatic debugging: Refinement and Guidance Debugging (RGD). The RGD framework is a multi-LLM-based agent debugger that leverages three distinct LLM agents
The recent performance improvements in mixed-integer programming (MIP) have been accompanied by a significantly increased complexity of the codes of MIP solvers, which poses challenges in fixing implementation errors. In this paper, we introduce MIP-DD, a solver-independent tool, which to the best of our knowledge is the first open-source delta debugger for MIP. Delta debugging is a hypothesis-trial-result approach to isolate the cause of a solver failure. MIP-DD simplifies MIP instances while maintaining the undesired behavior. Preliminary versions already supported and motivated fixes for many bugs in the SCIP releases 8.0.1 to 8.1.1. In these versions, MIP-DD successfully contributed to 24 out of all 51 documented MIP-related bugfixes even for some long-known issues. In selected case studies we highlight that instances triggering fundamental bugs in SCIP can typically be reduced to a few variables and constraints in less than an hour. This makes it significantly easier to manually trace and check the solution process on the resulting simplified instances. A promising future application of MIP-DD is the analysis of performance bottlenecks, which could very well benefit from simpl
Large language models (LLMs) are leading significant progress in code generation. Beyond one-pass code generation, recent works further integrate unit tests and program verifiers into LLMs to iteratively refine the generated programs. However, these works consider the generated programs as an indivisible entity, which falls short for LLMs in debugging the programs, especially when the programs contain complex logic flows and data operations. In contrast, when human developers debug programs, they typically set breakpoints and selectively examine runtime execution information. The execution flow and the intermediate variables play a crucial role in the debugging process, yet they are underutilized in the existing literature on code generation. In this study, we introduce Large Language Model Debugger (LDB), a novel debugging framework that enables LLMs to refine their generated programs with the runtime execution information. Specifically, LDB segments the programs into basic blocks and tracks the values of intermediate variables after each block throughout the runtime execution. This allows LLMs to concentrate on simpler code units within the overall execution flow, verify their co
Debugging distributed systems is hard. Most of the techniques that have been developed for debugging such systems use either extensive model checking, or postmortem analysis of logs and traces. Interactive debugging is typically a tool that is only effective in single threaded and single process applications, and is rarely applied to distributed systems. While the live observation of state changes using interactive debuggers is effective, it comes with a host of problems in distributed scenarios. In this paper, we discuss the requirements an interactive debugger for distributed systems should meet, the role the underlying distributed model plays in facilitating the debugger, and the implementation of our interactive debugger: GoTcha. GoTcha is a browser based interactive debugger for distributed systems built on the Global Object Tracker (GoT) programming model. We show how the GoT model facilitates the debugger, and the features that the debugger can offer. We also demonstrate a typical debugging workflow.
The Chromium open-source project has become a fundamental piece of the Web as we know it today, with multiple vendors offering browsers based on its codebase. One of its most popular features is the possibility of altering or enhancing the browser functionality through third-party programs known as browser extensions. Extensions have access to a wide range of capabilities through the use of APIs exposed by Chromium. The Debugger API -- arguably the most powerful of such APIs -- allows extensions to use the Chrome DevTools Protocol (CDP), a capability-rich tool for debugging and instrumenting the browser. In this paper, we describe several vulnerabilities present in the Debugger API and in the granting of capabilities to extensions that can be used by an attacker to take control of the browser, escalate privileges, and break context isolation. We demonstrate their impact by introducing six attacks that allow an attacker to steal user information, monitor network traffic, modify site permissions (\eg access to camera or microphone), bypass security interstitials without user intervention, and change the browser settings. Our attacks work in all major Chromium-based browsers as they a
In this tool demonstration, we give an overview of the Chameleon type debugger. The type debugger's primary use is to identify locations within a source program which are involved in a type error. By further examining these (potentially) problematic program locations, users gain a better understanding of their program and are able to work towards the actual mistake which was the cause of the type error. The debugger is interactive, allowing the user to provide additional information to narrow down the search space. One of the novel aspects of the debugger is the ability to explain erroneous-looking types. In the event that an unexpected type is inferred, the debugger can highlight program locations which contributed to that result. Furthermore, due to the flexible constraint-based foundation that the debugger is built upon, it can naturally handle advanced type system features such as Haskell's type classes and functional dependencies.
Debugging is an essential part of software maintenance and evolution since it allows software developers to analyze program execution step by step. Understanding a program is required to fix potential flaws, alleviate bottlenecks, and implement new desired features. Thus, software developers spend a large percentage of their time validating and debugging software, resulting in high software maintenance and evolution cost. We aim to reduce this cost by providing a novel visual debugging tool to software developers to foster program comprehension during debugging. Our debugging tool visualizes program execution information graphically as an object diagram and is fully integrated into the popular Java development environment IntelliJ IDEA. Moreover, the object diagram allows interactions to explore program execution information in more detail. A demonstration of our tool is available at https://www.youtube.com/watch?v=lU_OgotweRk.
Large-language-model code generation has shifted from single-shot prompting to multi-agent orchestrations - analyst, coder, tester, and debugger pipelines - and is evaluated almost exclusively on functional correctness. Whether these architectures also affect the structural complexity of the code they produce, and which orchestration layers carry the cost, remains largely unexamined: prior work has documented prompt-level effects on code complexity, but the architecture-level question is open. We compare six widely-used multi-agent configurations (Basic, AC, ACT, Debugger, AC+Debugger, ACT+Debugger) under two models from the GPT-4o family across all 164 HumanEval tasks - 1,968 paired observations - using the five RADON complexity metrics (SLOC, cyclomatic complexity, and Halstead Volume, Difficulty, and Effort). We apply a paired non-parametric statistical pipeline (Friedman omnibus, Wilcoxon signed-rank post-hoc with Holm correction, Kendall's $W$ and matched-pairs rank-biserial effect sizes) in both all-completions and passing-only conditions. The six architectures collapse into two indistinguishable complexity clusters separated by a 50-130% gap, the same partition in both model
Debugging nondeterministic programs is inherently difficult, particularly in microcontroller environments where execution paths can diverge unpredictably due to external sensor inputs. Traditional debugging techniques often fail to capture or reproduce this nondeterministic behavior effectively. Multiverse debugging has emerged as a compelling technique to debug nondeterministic programs, allowing developers to systematically explore all possible execution paths. Unfortunately, current multiverse debuggers are snapshot-based and most operate over a model of the program, limiting their use for debugging resource-constrained microcontrollers. Additionally, current multiverse debuggers, even ones specifically designed for microcontrollers suffer from state explosion making the state space overwhelming during debugging. To address these challenges, we introduce a trace-based multiverse debugger with a novel state-space reduction technique based on concolic execution. Our approach interleaves concolic analysis with live debugging to identify input values that define unique program paths. This hybrid technique efficiently prunes redundant paths from the state space while ensuring full co
Debugging is hard. Interactive debuggers are mostly the same. They show you a stack, a way to sample the state of the stack, and, if the debugger is live, a way to step through execution. The standard interactive debugger for a general-purpose programming language provided by a mainstream IDE mostly offers a low-level interface in terms of generic language constructs to track down and fix bugs. A custom debugger, such as those developed for specific application domains, offers alternative interfaces more suitable to the specific execution context of the program being debugged. Custom debuggers offering contextual debugging views and actions can greatly improve our ability to reason about the current problem. Implementing such custom debuggers, however, is non-trivial, and poses a barrier to improving the debugging experience. In this paper we introduce "moldable exceptions", a lightweight mechanism to adapt a debugger's interface based on contextual information provided by a raised exception. We present, through a series of examples, how moldable exceptions can enhance a live programming environment.
Motivated by experience in programming and in the teaching of programming, we make another assault on the longstanding problem of debugging. Having explored why debuggers are not used as widely as one might expect, especially in functional programming environments, we define the characteristics of a debugger which make it usable and thus likely to be widely used. We present work on a new debugger for the functional programming language OCaml which operates by direct interpretation of the program source, allowing the printing out of individual steps of the program's evaluation, and discuss its technical implementation and practical use. It has two parts: a stand-alone debugger which can run OCaml programs by interpretation and so allow their behaviour to be inspected; and an OCaml syntax extension, which allows the part of a program under scrutiny to be interpreted in the same fashion as the stand-alone debugger whilst the rest of the program runs natively. We show how this latter mechanism can create a source-level debugging system that has the characteristics of a usable debugger and so may eventually be expected to be suitable for widespread adoption.
Debugging non-deterministic programs on microcontrollers is notoriously challenging, especially when bugs manifest in unpredictable, input-dependent execution paths. A recent approach, called multiverse debugging, makes it easier to debug non-deterministic programs by allowing programmers to explore all potential execution paths. Current multiverse debuggers enable both forward and backward traversal of program paths, and some facilitate jumping to any previously visited states, potentially branching into alternative execution paths within the state space. Unfortunately, debugging programs that involve input/output operations using existing multiverse debuggers can reveal inaccessible program states, i.e. states which are not encountered during regular execution. This can significantly hinder the debugging process, as the programmer may spend substantial time exploring and examining inaccessible program states, or worse, may mistakenly assume a bug is present in the code, when in fact, the issue is caused by the debugger. This paper presents a novel approach to multiverse debugging, which can accommodate a broad spectrum of input/output operations. We provide the semantics of our a