Much work has been done on conversational LLM agents which directly assist human users with tasks. We present an alternative paradigm for interacting with LLM agents, which we call "overhearing agents". These overhearing agents do not actively participate in conversation -- instead, they "listen in" on human-to-human conversations and perform background tasks or provide suggestions to assist the user. In this work, we explore the overhearing agents paradigm through the lens of Dungeons & Dragons gameplay. We present an in-depth study using large multimodal audio-language models as overhearing agents to assist a Dungeon Master. We perform a human evaluation to examine the helpfulness of such agents and find that some large audio-language models have the emergent ability to perform overhearing agent tasks using implicit audio cues. Finally, we release Python libraries and our project code to support further research into the overhearing agents paradigm at https://github.com/zhudotexe/overhearing_agents.
Balancing combat encounters in Dungeons & Dragons (D&D) is a complex task that requires Dungeon Masters (DM) to manually assess party strength, enemy composition, and dynamic player interactions while avoiding interruption of the narrative flow. In this paper, we propose Encounter Generation via Reinforcement Learning (NTRL), a novel approach that automates Dynamic Difficulty Adjustment (DDA) in D&D via combat encounter design. By framing the problem as a contextual bandit, NTRL generates encounters based on real-time party members attributes. In comparison with classic DM heuristics, NTRL iteratively optimizes encounters to extend combat longevity (+200%), increases damage dealt to party members, reducing post-combat hit points (-16.67%), and raises the number of player deaths while maintaining low total party kills (TPK). The intensification of combat forces players to act wisely and engage in tactical maneuvers, even though the generated encounters guarantee high win rates (70%). Even in comparison with encounters designed by human Dungeon Masters, NTRL demonstrates superior performance by enhancing the strategic depth of combat while increasing difficulty in a manne
This paper explores the application of Large Language Models (LLMs) and reasoning to predict Dungeons & Dragons (DnD) player actions and format them as Avrae Discord bot commands. Using the FIREBALL dataset, we evaluated a reasoning model, DeepSeek-R1-Distill-LLaMA-8B, and an instruct model, LLaMA-3.1-8B-Instruct, for command generation. Our findings highlight the importance of providing specific instructions to models, that even single sentence changes in prompts can greatly affect the output of models, and that instruct models are sufficient for this task compared to reasoning models.
Fractals are ubiquitous natural emergences that have gained increased attention in engineering applications, thanks to recent technological advancements enabling the fabrication of structures spanning across many spatial scales. We show how the geometries of fractals can be exploited to determine their important mechanical properties, such as the first and second moments, which physically correspond to the center of mass and the moment of inertia, using a family of complex fractals known as the dragons.
DRAGONS (Data Reduction for Astronomy from Gemini Observatory North and South) is a platform for the reduction and processing of astronomical data. The Python-based, open-source package includes infrastructure for automation and algorithms for the processing of imaging and spectroscopic data, up to the analysis-ready stage. DRAGONS currently focuses on the reduction of Gemini data, although it allows for support of data from other instruments and telescopes through third-party extensions. Its latest release (v3.1) enables automated reduction of all currently-active Gemini imaging facility instruments, as well as optical longslit spectroscopic data, acquired with GMOS.
Dungeons & Dragons (D&D) is a tabletop roleplaying game with complex natural language interactions between players and hidden state information. Recent work has shown that large language models (LLMs) that have access to state information can generate higher quality game turns than LLMs that use dialog history alone. However, previous work used game state information that was heuristically created and was not a true gold standard game state. We present FIREBALL, a large dataset containing nearly 25,000 unique sessions from real D&D gameplay on Discord with true game state info. We recorded game play sessions of players who used the Avrae bot, which was developed to aid people in playing D&D online, capturing language, game commands and underlying game state information. We demonstrate that FIREBALL can improve natural language generation (NLG) by using Avrae state information, improving both automated metrics and human judgments of quality. Additionally, we show that LLMs can generate executable Avrae commands, particularly after finetuning.
Many NLP tasks, although well-resolved for general English, face challenges in specific domains like fantasy literature. This is evident in Named Entity Recognition (NER), which detects and categorizes entities in text. We analyzed 10 NER models on 7 Dungeons and Dragons (D&D) adventure books to assess domain-specific performance. Using open-source Large Language Models, we annotated named entities in these books and evaluated each model's precision. Our findings indicate that, without modifications, Flair, Trankit, and Spacy outperform others in identifying named entities in the D&D context.
This paper introduces the Forgotten Realms Wiki (FRW) data set and domain specific natural language generation using FRW along with related analyses. Forgotten Realms is the de-facto default setting of the popular open ended tabletop fantasy role playing game, Dungeons & Dragons. The data set was extracted from the Forgotten Realms Fandom wiki consisting of more than over 45,200 articles. The FRW data set is constituted of 11 sub-data sets in a number of formats: raw plain text, plain text annotated by article title, directed link graphs, wiki info-boxes annotated by the wiki article title, Poincaré embedding of first link graph, multiple Word2Vec and Doc2Vec models of the corpus. This is the first data set of this size for the Dungeons & Dragons domain. We then present a pairwise similarity comparison benchmark which utilizes similarity measures. In addition, we perform D&D domain specific natural language generation using the corpus and evaluate the named entity classification with respect to the lore of Forgotten Realms.
This paper outlines two approaches for mathematical, simulation, modeling, and analysis of hypothetical creatures, in particular, the dragons of HBO's television series Game of Thrones (GOT). Our first approach, the forward model, utilizes quasi-empirical observations of various features of GOT dragons. We then mathematically derive the growth rate, other dimensions, energy consumption, etc. In the backward model, we use projected energy consumption by given ecological impact to model an expected dragon in terms of physical features. We compare and contrast both models to examine the plausibility of a real-world existence for our titular dragons and provide brief analyses of potential impacts on ecology.
AI Advancements have augmented casual writing and story generation, but their usage poses challenges in collaborative storytelling. In role-playing games such as Dungeons & Dragons (D&D), composing prompts using generative AI requires a technical understanding to generate ideal results, which is difficult for novices. Thus, emergent narratives organically developed based on player actions and decisions have yet to be fully utilized. This paper envisions the use of generative AI in transforming storytelling into an interactive drama using dynamic and immersive narratives. First, we describe scenarios where narratives are created and character conversations are designed within an overarching fantasy disposition. Then, we recommend design guidelines to help create tools using generative AI in interactive storytelling. Lastly, we raise questions on its potential impact on player immersion and cognitive load. Our contributions may be expanded within the broader interactive storytelling domain, such as speech-conversational AI and persona-driven chatbots.
AI researchers have posited Dungeons and Dragons (D&D) as a challenge problem to test systems on various language-related capabilities. In this paper, we frame D&D specifically as a dialogue system challenge, where the tasks are to both generate the next conversational turn in the game and predict the state of the game given the dialogue history. We create a gameplay dataset consisting of nearly 900 games, with a total of 7,000 players, 800,000 dialogue turns, 500,000 dice rolls, and 58 million words. We automatically annotate the data with partial state information about the game play. We train a large language model (LM) to generate the next game turn, conditioning it on different information. The LM can respond as a particular character or as the player who runs the game--i.e., the Dungeon Master (DM). It is trained to produce dialogue that is either in-character (roleplaying in the fictional world) or out-of-character (discussing rules or strategy). We perform a human evaluation to determine what factors make the generated output plausible and interesting. We further perform an automatic evaluation to determine how well the model can predict the game state given the his
We propose a novel task, G4C, to study teacher-student natural language interactions in a goal-driven and grounded environment. Dungeons and Dragons (D&D), a role-playing game, provides an ideal setting to investigate such interactions. Here, the Dungeon Master (DM), i.e., the teacher, guides the actions of several players -- students, each with their own personas and abilities -- to achieve shared goals grounded in a fantasy world. Our approach is to decompose and model these interactions into (1) the DM's intent to guide players toward a given goal; (2) the DM's guidance utterance to the players expressing this intent; and (3) a theory-of-mind (ToM) model that anticipates the players' reaction to the guidance one turn into the future. We develop a novel reinforcement learning (RL) method for training a DM that generates guidance for players by rewarding utterances where the intent matches the ToM-anticipated player actions. Human and automated evaluations show that a DM trained to explicitly model intents and incorporate ToM of the players using RL generates better-quality guidance that is 3x more likely to fulfill the DM's intent than a vanilla natural language generation (N
For a graph $G$ and vertices $u,v$, we define the ASUA of $v$, $t(G,v,u)$, to be the average steps until absorption along a random walk terminating at $u$. We define a sea dragon to be a tree with a unique path $P$ such that if $d(u) \geq 3$ for some vertex $u$, then $u \in V(P)$. We use Markov chains to determine $t(G,v,u)$ for all vertices of several classes of sea dragons, a broad subclass of trees. Additionally, we give several results on equations related to ASUAs on general graphs.
A Littlewood polynomial is a polynomial whose coefficients lie in $\{- 1, +1\}$. While the majority of roots of a Littlewood polynomial of large degree are near the unit circle, numerical experiments suggest that when plotting the roots of \emph{all} Littlewood polynomials of a given large degree, striking fractal structures appear away from the unit circle. These fractals resemble the attractor of a certain iterated function system and are known as \emph{dragon curves}. In this note, we provide a rigorous explanation of this phenomenon, along with an analysis of a random variant, saying that such fractal behavior is typical.
The ability to automatically classify source code repositories with ''topics'' that reflect their content and purpose is very useful, especially when navigating or searching through large software collections. However, existing approaches often rely heavily on README files and other metadata, which are frequently missing, limiting their applicability in real-world large-scale settings. We present DRAGON, a repository classifier designed for very large and diverse software collections. It operates entirely on lightweight signals commonly stored in version control systems: file and directory names, and optionally the README when available. In repository classification at scale, DRAGON improves F1@5 from 54.8% to 60.8%, surpassing the state of the art. DRAGON remains effective even when README files are absent, with performance degrading by only 6% w.r.t. when they are present. This robustness makes it practical for real-world settings where documentation is sparse or inconsistent. Furthermore, many of the remaining classification errors are near misses, where predicted labels are semantically close to the correct topics. This property increases the practical value of the predictions
The Heighway dragon curve is one of the most known fractal curves. There are two ways to construct the curve: repeatedly make a copy of the current curve, rotate it by 90 degrees, and connect them; or repeatedly replace each straight segment in the curve by two segments with a right angle. A natural question is how do we prove the equivalence of the two approaches? We generalise the construction of the curve to allow rotations to both sides. It then turns out that the two approaches are respectively a foldr and a foldl, and the key property for proving their equivalence, using the second duality theorem, is the distributivity of an "interleave" operator.
Diagram question answering (DQA) requires models to interpret structured visual representations such as charts, maps, infographics, circuit schematics, and scientific diagrams. Recent vision-language models (VLMs) often achieve high answer accuracy on these tasks, yet correct answers do not guarantee that models ground their reasoning in the diagram regions that support the prediction. Models may instead rely on textual correlations or dataset artifacts without identifying the visual evidence required to verify the answer. This limitation prevents reliable evaluation of diagram reasoning and reduces interpretability. We introduce DRAGON, a benchmark for evaluating evidence-grounded visual reasoning in diagrams. Given a diagram, a question, and the correct answer, a model must predict bounding boxes that correspond to the visual elements required to justify the answer. These evidence regions may include answer-bearing components, textual labels, legends, axes, connectors, and other supporting structures involved in the reasoning process. The DRAGON dataset contains 11,664 annotated question instances collected from six diagram QA datasets: ChartQA, Circuit-VQA, InfographicsVQA, MapI
We show that the geometric aspect ratio of the Twin Dragon equals $1/\varphi$, where $\varphi = (1+\sqrt{5})/2$ is the golden ratio. The result follows by solving the covariance fixed-point equation for the self-similar measure, which coincides with Lebesgue area since the similarity dimension is 2. The appearance of $\varphi$ is surprising: the Twin Dragon is defined purely via the Gaussian integer $1+i$, with no pentagonal or Fibonacci structure in its construction.
Traditional monitoring of bearded dragon (Pogona Viticeps) behaviour is time-consuming and prone to errors. This project introduces an automated system for real-time video analysis, using You Only Look Once (YOLO) object detection models to identify two key behaviours: basking and hunting. We trained five YOLO variants (v5, v7, v8, v11, v12) on a custom, publicly available dataset of 1200 images, encompassing bearded dragons (600), heating lamps (500), and crickets (100). YOLOv8s was selected as the optimal model due to its superior balance of accuracy (mAP@0.5:0.95 = 0.855) and speed. The system processes video footage by extracting per-frame object coordinates, applying temporal interpolation for continuity, and using rule-based logic to classify specific behaviours. Basking detection proved reliable. However, hunting detection was less accurate, primarily due to weak cricket detection (mAP@0.5 = 0.392). Future improvements will focus on enhancing cricket detection through expanded datasets or specialised small-object detectors. This automated system offers a scalable solution for monitoring reptile behaviour in controlled environments, significantly improving research efficiency
The dragon-king earthquake hypothesis proposes that some very large to great earthquakes are not merely the extreme end of the frequency-magnitude Gutenberg-Richter distribution (FMD), but are generated by distinct physical mechanisms, making them statistical outliers. We develop a data-driven framework to systematically test the dragon-king earthquake hypothesis. Our method combines objective spatial clustering, based on data-adaptive kernel density estimation (KDE), with a high-power sequential outlier detection technique. For each identified cluster, we exam sine the tail of the FMD to identify anomalous events. Candidate dragon-kings are evaluated via robust statistical tests, primarily the max-robust-sum (MRS) test with inward sequential testing. For each observed statistic of the MRS test, we calculate its p-value defined as the probability that this statistic could be generated by the null distribution. We apply this framework to seismicity surrounding the 1975 Haicheng mL 7.4 and 1976 Tangshan mL 7.9 earthquakes. For Haicheng, the mainshock shows a strong dragon-king signature in its pre-mainshock sequence, with p-values between 0.03 and 0.07 across a stable range of KDE de