What are the limits of controlling language models via synthetic training data? We develop a reinforcement learning (RL) primitive, the Dataset Policy Gradient (DPG), which can precisely optimize synthetic data generators to produce a dataset of targeted examples. When used for supervised fine-tuning (SFT) of a target model, these examples cause the target model to do well on a differentiable metric of our choice. Our approach achieves this by taking exact data attribution via higher-order gradients and using those scores as policy gradient rewards. We prove that this procedure closely approximates the true, intractable gradient for the synthetic data generator. To illustrate the potential of DPG, we show that, using only SFT on generated examples, we can cause the target model's LM head weights to (1) embed a QR code, (2) embed the pattern $\texttt{67}$, and (3) have lower $\ell^2$ norm. We additionally show that we can cause the generator to (4) rephrase inputs in a new language and (5) produce a specific UUID, even though neither of these objectives is conveyed in the generator's input prompts. These findings suggest that DPG is a powerful and flexible technique for shaping mode
Pre-deployment evaluations inspect only a limited sample of model actions. A malicious model seeking to evade oversight could exploit this by randomizing when to "defect": misbehaving so rarely that no malicious actions are observed during evaluation, but often enough that they occur eventually in deployment. But this requires taking actions at very low rates, while maintaining calibration. Are frontier models even capable of that? We prompt the GPT-5, Claude-4.5 and Qwen-3 families to take a target action at low probabilities (e.g. 0.01%), either given directly or requiring derivation, and evaluate their calibration (i.e. whether they perform the target action roughly 1 in 10,000 times when resampling). We find that frontier models are surprisingly good at this task. If there is a source of entropy in-context (such as a UUID), they maintain high calibration at rates lower than 1 in 100,000 actions. Without external entropy, some models can still reach rates lower than 1 in 10,000. When target rates are given, larger models achieve good calibration at lower rates. Yet, when models must derive the optimal target rate themselves, all models fail to achieve calibration without entropy
Distributed applications need identifiers that satisfy storage efficiency, chronological sortability, origin metadata embedding, zero-lookup verifiability, confidentiality for external consumers, and multi-century addressability. Based on our literature survey, no existing scheme provides all six of these identifier properties within a unified system. This paper introduces Source Known Identifiers (SKIDs), a three-tier identity system that projects a single entity identity across trust boundaries, addressing all six properties. The first tier, Source Known ID (SKID), is a 64-bit signed integer embedding a timestamp with a 250-millisecond precision, application topology, and a per-entity-type sequence counter. It serves as the database primary key, providing compact storage (8 bytes) and natural B-tree ordering for optimized database indexing. The second tier, Source Known Entity ID (SKEID), extends the SKID into a 128-bit Universally Unique Identifier (UUID) compatible value by adding an entity type discriminator, an epoch selector, and a BLAKE3 keyed message authentication code (MAC). SKEIDs enable zero-lookup verification of identifier origin, integrity, and entity type within tr
Personal information retrieval fails when systems ignore how human memory works. While existing platforms force keyword searches across isolated silos, humans naturally recall through episodic cues like when, where, and in what context information was encountered. This dissertation presents the Unified Personal Index (UPI), a memory-aligned architecture that bridges this fundamental gap. The Indaleko prototype demonstrates the UPI's feasibility on a 31-million file dataset spanning 160TB across eight storage platforms. By integrating temporal, spatial, and activity metadata into a unified graph database, Indaleko enables natural language queries like "photos near the conference venue last spring" that existing systems cannot process. The implementation achieves sub-second query responses through memory anchor indexing, eliminates cross-platform search fragmentation, and maintains perfect precision for well-specified memory patterns. Evaluation against commercial systems (Google Drive, OneDrive, Dropbox, Windows Search) reveals that all fail on memory-based queries, returning overwhelming result sets without contextual filtering. In contrast, Indaleko successfully processes multi-di
The Multi-platform Aggregated Dataset of Online Communities (MADOC) is a comprehensive dataset that facilitates computational social science research by providing FAIR-compliant standardized access to cross-platform analysis of online social dynamics. MADOC aggregates and standardizes data from Bluesky, Koo, Reddit, and Voat (2012-2024), containing 18.9 million posts, 236 million comments, and 23.1 million unique users. The dataset enables comparative studies of toxic behavior evolution across platforms through standardized interaction records and sentiment analysis. By providing UUID-anonymized user histories and temporal alignment of banned communities' activity patterns, MADOC supports research on content moderation impacts and platform migration trends. Distributed via Zenodo with persistent identifiers and Python/R toolkits, the dataset adheres to FAIR principles while addressing post-API-era research challenges through ethical aggregation of public social media archives.
This technical note presents a reproducible workflow for converting a legacy archaeological image collection into a structured and segmentation ready dataset. The case study focuses on the Lower Palaeolithic hand axe and biface collection curated by the Archaeology Data Service (ADS), a dataset that provides thousands of standardised photographs but no mechanism for bulk download or automated processing. To address this, two open source tools were developed: a web scraping script that retrieves all record pages, extracts associated metadata, and downloads the available images while respecting ADS Terms of Use and ethical scraping guidelines; and an image processing pipeline that renames files using UUIDs, generates binary masks and bounding boxes through classical computer vision, and stores all derived information in a COCO compatible Json file enriched with archaeological metadata. The original images are not redistributed, and only derived products such as masks, outlines, and annotations are shared. Together, these components provide a lightweight and reusable approach for transforming web based archaeological image collections into machine learning friendly formats, facilitati
Reasoning over long contexts is essential for large language models. While reinforcement learning (RL) enhances short-context reasoning by inducing "Aha" moments in chain-of-thought, the advanced thinking patterns required for long-context reasoning remain largely unexplored, and high-difficulty RL data are scarce. In this paper, we introduce LoongRL, a data-driven RL method for advanced long-context reasoning. Central to LoongRL is KeyChain, a synthesis approach that transforms short multi-hop QA into high-difficulty long-context tasks by inserting UUID chains that hide the true question among large collections of distracting documents. Solving these tasks requires the model to trace the correct chain step-by-step, identify the true question, retrieve relevant facts and reason over them to answer correctly. RL training on KeyChain data induces an emergent plan-retrieve-reason-recheck reasoning pattern that generalizes far beyond training length. Models trained at 16K effectively solve 128K tasks without prohibitive full-length RL rollout costs. On Qwen2.5-7B and 14B, LoongRL substantially improves long-context multi-hop QA accuracy by +23.5% and +21.1% absolute gains. The resultin
Effective recommendation systems rely on capturing user preferences, often requiring incorporating numerous features such as universally unique identifiers (UUIDs) of entities. However, the exceptionally high cardinality of UUIDs poses a significant challenge in terms of model degradation and increased model size due to sparsity. This paper presents two innovative techniques to address the challenge of high cardinality in recommendation systems. Specifically, we propose a bag-of-words approach, combined with layer sharing, to substantially decrease the model size while improving performance. Our techniques were evaluated through offline and online experiments on Uber use cases, resulting in promising results demonstrating our approach's effectiveness in optimizing recommendation systems and enhancing their overall performance.
Establishing reliable data exchange in an underwater domain using energy and power-efficient communication methods is crucial and challenging. Radio frequencies are absorbed by the salty and mineral-rich water and optical signals are obstructed and scattered after short distances. In contrast, acoustic communication benefits from low absorption and enables communication over long distances. Underwater communication must match low power and energy requirements as underwater sensor systems must have a long battery lifetime and need to work reliably due to their deployment and maintenance cost. For long-term deployments, the sensors' overall power consumption is determined by the power consumption during idle state. It can be reduced by integrating asynchronous always-on wake-up circuits with nano-watt power consumption. However, this approach does reduce but not eliminate idle power consumption, leaving a margin for improvement. This paper presents a passive and asynchronous wake-up receiver for acoustic underwater communication enabling zero-power always-on listening. Zero-power listening is achieved by combining energy and information transmission using a low-power wake-up receiver
Humans can perform complex tasks with long-term objectives by planning, reasoning, and forecasting outcomes of actions. For embodied agents to achieve similar capabilities, they must gain knowledge of the environment transferable to novel scenarios with a limited budget of additional trial and error. Learning-based approaches, such as deep RL, can discover and take advantage of inherent regularities and characteristics of the application domain from data, and continuously improve their performances, however at a cost of large amounts of training data. This thesis explores the development of data-driven techniques for spatial reasoning and planning tasks, focusing on enhancing learning efficiency, interpretability, and transferability across novel scenarios. Four key contributions are made. 1) CALVIN, a differential planner that learns interpretable models of the world for long-term planning. It successfully navigated partially observable 3D environments, such as mazes and indoor rooms, by learning the rewards and state transitions from expert demonstrations. 2) SOAP, an RL algorithm that discovers options unsupervised for long-horizon tasks. Options segment a task into subtasks and
The training of diffusion-based models for image generation is predominantly controlled by a select few Big Tech companies, raising concerns about privacy, copyright, and data authority due to their lack of transparency regarding training data. To ad-dress this issue, we propose a federated diffusion model scheme that enables the independent and collaborative training of diffusion models without exposing local data. Our approach adapts the Federated Averaging (FedAvg) algorithm to train a Denoising Diffusion Model (DDPM). Through a novel utilization of the underlying UNet backbone, we achieve a significant reduction of up to 74% in the number of parameters exchanged during training,compared to the naive FedAvg approach, whilst simultaneously maintaining image quality comparable to the centralized setting, as evaluated by the FID score.
The proliferation of AI-generated images has intensified the need for robust content authentication methods. We present InvisMark, a novel watermarking technique designed for high-resolution AI-generated images. Our approach leverages advanced neural network architectures and training strategies to embed imperceptible yet highly robust watermarks. InvisMark achieves state-of-the-art performance in imperceptibility (PSNR$\sim$51, SSIM $\sim$ 0.998) while maintaining over 97\% bit accuracy across various image manipulations. Notably, we demonstrate the successful encoding of 256-bit watermarks, significantly expanding payload capacity while preserving image quality. This enables the embedding of UUIDs with error correction codes, achieving near-perfect decoding success rates even under challenging image distortions. We also address potential vulnerabilities against advanced attacks and propose mitigation strategies. By combining high imperceptibility, extended payload capacity, and resilience to manipulations, InvisMark provides a robust foundation for ensuring media provenance in an era of increasingly sophisticated AI-generated content. Source code of this paper is available at: ht
Streamlined Byzantine Fault Tolerant (BFT) protocols, such as HotStuff [PODC'19], and weighted voting represent two possible strategies to improve consensus in the distributed systems world. Several studies have been conducted on both techniques, but the research on combining the two is scarce. To cover this knowledge gap, we introduce a weighted voting approach on Hotstuff, along with two optimisations targeting weight assignment distribution and leader rotation in the underlying state replication protocol. Moreover, the weighted protocols developed rely on studies proving the effectiveness of a specific voting power assignment based on discrete values. We generalise this approach by presenting a novel continuous weighting scheme applied to the Hotstuff protocol to highlight the effectiveness of this technique in faulty scenarios. We prove the significant latency reduction impact of weighted voting on streamlined protocols and advocate for further research.
In this paper we describe a novel method of exchanging data between Bluetooth smartphones on the Android platform without requiring pairing between devices. We discuss our approach of encoding and decoding data inside the UUIDs used by the Bluetooth Service Discovery Protocol (SDP). Future research remains to be done on the latency, bandwidth and compatibility of this approach, as well as the possibility of utilizing other protocols in conjunction with this method.
It has become common to publish large (billion parameter) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model. We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data. These extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs. Our attack is possible even though each of the above sequences are included in just one document in the training data. We comprehensively evaluate our extraction attack to understand the factors that contribute to its success. Worryingly, we find that larger models are more vulnerable than smaller models. We conclude by drawing lessons and discussing possible safeguards for training large language models.
Code initialization -- the step of loading code, executing static code, filling caches, and forming re-used connections -- tends to dominate cold-start time in serverless compute systems such as AWS Lambda. Post-initialization memory snapshots, cloned and restored on start, have emerged as a viable solution to this problem, with incremental snapshot and fast restore support in VMMs like Firecracker. Saving memory introduces the challenge of managing high-value memory contents, such as cryptographic secrets. Cloning introduces the challenge of restoring the uniqueness of the VMs, to allow them to do unique things like generate UUIDs, secrets, and nonces. This paper examines solutions to these problems in the every microsecond counts context of serverless cold-start, and discusses the state-of-the-art of available solutions. We present two new interfaces aimed at solving this problem -- MADV\_WIPEONSUSPEND and SysGenId -- and compare them to alternative solutions.
Astronomers have finally cracked the mystery behind a strange class of repeating cosmic signals that has baffled scientists for years。 Using Australia’s ASKAP radio telescope, researchers traced the bursts to a rare stellar duo in which a dense white dwarf is relentlessly siphoning material from a nearby red dwarf companion。 As the stolen matter sp
For more than a century, pianists and music teachers have argued over whether a performer’s touch can actually change the tone color of a piano note — and now scientists say the answer is yes。 Using a cutting-edge sensor system that tracked piano key movements at 1,000 frames per second, researchers discovered that elite pianists subtly manipulate