This paper studies the impact of DRAM writes on DDR5-based system. To efficiently perform DRAM writes, modern systems buffer write requests and try to complete multiple write operations whenever the DRAM mode is switched from read to write. When the DRAM system is performing writes, it is not available to service read requests, thus increasing read latency and reducing performance. We observe that, given the presence of on-die ECC in DDR5 devices, the time to perform a write operation varies significantly: from 1x (for writes to banks of different bankgroups) to 6x (for writes to banks within the same bankgroup) to 24x (for conflicting requests to the same bank). If we can orchestrate the write stream to favor write requests that incur lower latency, then we can reduce the stall time from DRAM writes and improve performance. However, for current systems, the write stream is dictated by the cache replacement policy, which makes eviction decisions without being aware of the variable latency of DRAM writes. The key insight of our work is to improve performance by modifying the cache replacement policy to increase bank-parallelism of DRAM writes. Our paper proposes {\em BARD (Bank-Awar
Phase Change Memory (PCM) has rapidly progressed and surpassed Dynamic Random-Access Memory (DRAM) in terms of scalability and standby energy efficiency. Altering a PCM cell's state during writes demands substantial energy, posing a significant challenge to PCM's role as the primary main memory. Prior research has explored methods to reduce write energy consumption, including the elimination of redundant writes, minimizing cell writes, and employing compact row buffers for filtering PCM main memory accesses. However, these techniques had certain drawbacks like bit-wise comparison of the stored values, preemptive updates increasing write cycles, and poor endurance. In this paper, we propose WIRE, a new coding mechanism through which most write operations force a maximum of one-bit flip. In this coding-based data storage method, we look at the frequent value stack and assign a code word to the most frequent values such that they have a hamming distance of one. In most of the write accesses, writing a value needs one or fewer bit flips which can save considerable write energy. This technique can be augmented with a wear-leveling mechanism at the block level, and rotating the differenc
This paper demonstrates that adopting out-of-place writes is essential for database systems to fully leverage SSD performance and extend SSD lifespan. We propose a set of out-of-place optimizations that collectively reduce write amplification across both the DBMS and SSD layers. We redesign the in-place, B-tree-based LeanStore to write out-of-place and support these optimizations, and evaluate it on diverse OLTP benchmarks, dataset sizes, and SSDs. The final design improves throughput by 1.65-2.24x and reduces flash writes per operation by 6.2-9.8x on YCSB-A. On TPC-C with 15,000 warehouses, throughput improves by 2.45x while flash writes decrease by 7.2x. Finally, we show that the architecture can seamlessly support novel SSD interfaces such as ZNS and FDP.
LLM-powered coding agents increasingly rely on tool-use protocols such as the Model Context Protocol (MCP) to read and write files on a developer's workstation. When a write fails - due to content filters, truncation, or an interrupted session - the agent typically receives no structured signal, loses the draft, and wastes tokens retrying blindly. We present Resilient Write, an MCP server that interposes a six-layer durable write surface between the agent and the filesystem. The layers - pre-flight risk scoring, transactional atomic writes, resume-safe chunking, structured typed errors, out-of-band scratchpad storage, and task-continuity handoff envelopes - are orthogonal and independently adoptable. Each layer maps to a concrete failure mode observed during a real agent session in April 2026, in which content-safety filters silently rejected a draft containing redacted API-key prefixes. Three additional tools - chunk preview, format-aware validation, and journal analytics - emerged from using the system to compose this paper. A 186-test suite validates correctness at each layer, and quantitative comparison against naive and defensive baselines shows a 5x reduction in recovery time
As dynamic random access memory (DRAM) and other current transistor-based memories approach their scalability limits, the search for alternative storage methods becomes increasingly urgent. Phase-change memory (PCM) emerges as a promising candidate due to its scalability, fast access time, and zero leakage power compared to many existing memory technologies. However, PCM has significant drawbacks that currently hinder its viability as a replacement. PCM cells suffer from a limited lifespan because write operations degrade the physical material, and these operations consume a considerable amount of energy. For PCM to be a practical option for data storage-which involves frequent write operations-its cell endurance must be enhanced, and write energy must be reduced. In this paper, we propose SMART-WRITE, a method that integrates neural networks (NN) and reinforcement learning (RL) to dynamically optimize write energy and improve performance. The NN model monitors real-time operating conditions and device characteristics to determine optimal write parameters, while the RL model dynamically adjusts these parameters to further optimize PCM's energy consumption. By continuously adjusting
Log-Structured Merge (LSM) tree-based Key-Value Stores (KVSs) are widely adopted for their high performance in write-intensive environments, but they often face performance degradation due to write stalls during compaction. Prior solutions, such as regulating I/O traffic or using multiple compaction threads, can cause unexpected drops in throughput or increase host CPU usage, while hardware-based approaches using FPGA, GPU, and DPU aimed at reducing compaction duration introduce additional hardware costs. In this study, we propose KVACCEL, a novel hardware-software co-design framework that eliminates write stalls by leveraging a dual-interface SSD. KVACCEL allocates logical NAND flash space to support both block and key-value interfaces, using the key-value interface as a temporary write buffer during write stalls. This strategy significantly reduces write stalls, optimizes resource usage, and ensures consistency between the host and device by implementing an in-device LSM-based write buffer with an iterator-based range scan mechanism. Our extensive evaluation shows that for write-intensive workloads, KVACCEL outperforms ADOC by up to 1.17x in terms of throughput and performance-to
As transistor-based memory technologies like dynamic random access memory (DRAM) approach their scalability limits, the need to explore alternative storage solutions becomes increasingly urgent. Phase-change memory (PCM) has gained attention as a promising option due to its scalability, fast access speeds, and zero leakage power compared to conventional memory systems. However, despite these advantages, PCM faces several challenges that impede its broader adoption, particularly its limited lifespan due to material degradation during write operations, as well as the high energy demands of these processes. For PCM to become a viable storage alternative, enhancing its endurance and reducing the energy required for write operations are essential. This paper proposes the use of a neural network (NN) model to predict critical parameters such as write latency, energy consumption, and endurance by monitoring real-time operating conditions and device characteristics. These predictions are key to improving PCM performance and identifying optimal write settings, making PCM a more practical and efficient option for data storage in applications with frequent write operations. Our approach leads
High read and write performance is important for generic key-value stores, which are foundational to modern applications and databases. Yet, achieving high performance for mixed and dynamic workloads is challenging due to fundamental trade-offs between memory use and I/O for retrieval and updates. Past work emphasizes the trade-off between read- and write-optimization as expressed through primary data structure, in combination with read-memory trade-off mechanisms like caching and filtering. This raises re-tuning costs as optimal trade-off targets change, due to restructuring of stored data. We show that write-memory trade-off mechanisms are under-developed in current designs, and propose a new approach to dynamic key-value store optimization using a novel read-/write-balanced on-disk structure, the TurtleTree, and flexible read-memory & write-memory tuning knobs. We describe how the design of TurtleKV, our prototype, avoids in-memory bottlenecks to achieve high performance across a wide range of tuning parameters. When evaluated using YCSB, TurtleKV matches state-of-the-art SplinterDB for inserts, and is 5x/12x faster than RockDB/WiredTiger. In mixed workloads, TurtleKV is 16-
The increasing use of Non-Volatile Memory (NVM) in computer architecture has brought about new challenges, one of which is the write endurance problem. Frequent writes to a particular cache cell in NVM can lead to degradation of the memory cell and reduce its lifespan. To solve this problem, we propose a sample-based blocking technique for the Last Level Cache (LLC). Our approach involves defining a threshold value and sampling a subset of cache sets. If the number of writes to a way in a sampled set exceeds the threshold, the way is blocked, and writes are redirected to other ways. We also maintain a history structure to record the number of writes in a set and a PC-Table to use for blocking in unsampled sets. Based on blocking on sampled sets, variance of values stored in history is used to determine whether blocking had a positive impact or not, and on this basis, value corresponding to instruction pointer is incremented or decremented. This value is later used for blocking in unsampled sets. Our results show that our approach significantly balances write traffic to the cache and improves the overall lifespan of the memory cells while having better performance to the base-line s
There is a long history of side channels in the memory hierarchy of modern CPUs. Especially the cache side channel is widely used in the context of transient execution attacks and covert channels. Therefore, many secure cache architectures have been proposed. Most of these architectures aim to make the construction of eviction sets infeasible by randomizing the address-to-cache mapping. In this paper, we investigate the peculiarities of write instructions in recent CPUs. We identify Write+Write, a new side channel on Intel CPUs that leaks whether two addresses contend for the same cache set. We show how Write+Write can be used for rapid construction of eviction sets on current cache architectures. Moreover, we replicate the Write+Write effect in gem5 and demonstrate on the example of ScatterCache how it can be exploited to efficiently attack state-of-the-art cache randomization schemes. In addition to the Write+Write side channel, we show how Write-After-Write effects can be leveraged to efficiently synchronize covert channel communication across CPU cores. This yields the potential for much more stealthy covert channel communication than before.
This work sheds light on whether and how creative writers' needs are met by existing research and commercial writing support tools (WST). We conducted a need finding study to gain insight into the writers' process during creative writing through a qualitative analysis of the response from an online questionnaire and Reddit discussions on r/Writing. Using a systematic analysis of 115 tools and 67 research papers, we map out the landscape of how digital tools facilitate the writing process. Our triangulation of data reveals that research predominantly focuses on the writing activity and overlooks pre-writing activities and the importance of visualization. We distill 10 key takeaways to inform future research on WST and point to opportunities surrounding underexplored areas. Our work offers a holistic and up-to-date account of how tools have transformed the writing process, guiding the design of future tools that address writers' evolving and unmet needs.
This paper introduces GTX, a standalone main-memory write-optimized graph data system that specializes in structural and graph property updates while enabling concurrent reads and graph analytics through ACID transactions. Recent graph systems target concurrent read and write support while guaranteeing transaction semantics. However, their performance suffers from updates with real-world temporal locality over the same vertices and edges due to vertex-centric lock contentions. GTX has an adaptive delta-chain locking protocol on top of a carefully designed latch-free graph storage. It eliminates vertex-level locking contention, and adapts to real-life workloads while maintaining sequential access to the graph's adjacency lists storage. GTX's transactions further support cache-friendly block level concurrency control, and cooperative group commit and garbage collection. This combination of features ensures high update throughput and provides low-latency graph analytics. Based on experimental evaluation, in addition to not sacrificing the performance of read-heavy analytical workloads, and having competitive performance similar to state-of-the-art systems, GTX has high read-write tran
For a write request, today flash storage cannot distinguish the logical object it comes from. In such object-oblivious flash devices, concurrent writes from different objects are simply packed in their arrival order to flash memory blocks; hence objects with different lifetimes are multiplexed onto the same flash blocks. This multiplexing incurs write amplification, worsening the performance. Tackling the multiplexing problem, we propose a novel interface for flash storage, FlashAlloc. It is used to pass the logical address ranges of logical objects to the flash storage and thus enlighten the storage to stream writes by objects. The object-aware flash storage can de-multiplex writes from different objects with distinct deathtimes into per-object dedicated flash blocks. Given that popular data stores separate writes using objects (e.g., SSTables in RocksDB), we can achieve, unlike the existing solutions, transparent write streaming just by calling FlashAlloc upon object creation. Our experimental results using an open-source SSD prototype demonstrate that FlashAlloc can reduce write amplification factor (WAF) in RocksDB, F2FS, and MySQL by 1.5, 2.5, and 0.3, respectively and thus im
Can LLMs provide support to creative writers by giving meaningful writing feedback? In this paper, we explore the challenges and limitations of model-generated writing feedback by defining a new task, dataset, and evaluation frameworks. To study model performance in a controlled manner, we present a novel test set of 1,300 stories that we corrupted to intentionally introduce writing issues. We study the performance of commonly used LLMs in this task with both automatic and human evaluation metrics. Our analysis shows that current models have strong out-of-the-box behavior in many respects -- providing specific and mostly accurate writing feedback. However, models often fail to identify the biggest writing issue in the story and to correctly decide when to offer critical vs. positive feedback.
Lossy compression is one of the most efficient solutions to reduce storage overhead and improve I/O performance for HPC applications. However, existing parallel I/O libraries cannot fully utilize lossy compression to accelerate parallel write due to the lack of deep understanding on compression-write performance. To this end, we propose to deeply integrate predictive lossy compression with HDF5 to significantly improve the parallel-write performance. Specifically, we propose analytical models to predict the time of compression and parallel write before the actual compression to enable compression-write overlapping. We also introduce an extra space in the process to handle possible data overflows resulting from prediction uncertainty in compression ratios. Moreover, we propose an optimization to reorder the compression tasks to increase the overlapping efficiency. Experiments with up to 4,096 cores from Summit show that our solution improves the write performance by up to 4.5X and 2.9X over the non-compression and lossy compression solutions, respectively, with only 1.5% storage overhead (compared to original data) on two real-world HPC applications.
Oblivious RAM protocols (ORAMs) allow a client to access data from an untrusted storage device without revealing the access patterns. Typically, the ORAM adversary can observe both read and write accesses. Write-only ORAMs target a more practical, {\em multi-snapshot adversary} only monitoring client writes -- typical for plausible deniability and censorship-resilient systems. This allows write-only ORAMs to achieve significantly-better asymptotic performance. However, these apparent gains do not materialize in real deployments primarily due to the random data placement strategies used to break correlations between logical and physical namespaces, a required property for write access privacy. Random access performs poorly on both rotational disks and SSDs (often increasing wear significantly, and interfering with wear-leveling mechanisms). In this work, we introduce SqORAM, a new locality-preserving write-only ORAM that preserves write access privacy without requiring random data access. Data blocks close to each other in the logical domain land in close proximity on the physical media. Importantly, SqORAM maintains this data locality property over time, significantly increasing re
Concurrency control protocols are the key to scaling current DBMS performances. They efficiently interleave read and write operations in transactions, but occasionally they restrict concurrency by using coordination such as exclusive lockings. Although exclusive lockings ensure the correctness of DBMS, it incurs serious performance penalties on multi-core environments. In particular, existing protocols generally suffer from emerging highly write contended workloads, since they use innumerable lockings for write operations. In this paper, we rethink the Thomas write rule (TWR), which allows the timestamp ordering (T/O) protocol to omit write operations without any lockings. We formalize the notion of omitting and decouple it from the T/O protocol implementation, in order to define a new rule named non-visible write rule (NWR). When the rules of NWR are satisfied, any protocol can in theory generate omittable write operations with preserving the correctness without any lockings. In the experiments, we implement three NWR-extended protocols: Silo+NWR, TicToc+NWR, and MVTO+NWR. Experimental results demonstrate the efficiency and the low-overhead property of the extended protocols. We c
Although Arabic is spoken by over 400 million people, advanced Arabic writing assistance tools remain limited. To address this gap, we present ARWI, a new writing assistant that helps learners improve essay writing in Modern Standard Arabic. ARWI is the first publicly available Arabic writing assistant to include a prompt database for different proficiency levels, an Arabic text editor, state-of-the-art grammatical error detection and correction, and automated essay scoring aligned with the Common European Framework of Reference standards for language attainment. Moreover, ARWI can be used to gather a growing auto-annotated corpus, facilitating further research on Arabic grammar correction and essay scoring, as well as profiling patterns of errors made by native speakers and non-native learners. A preliminary user study shows that ARWI provides actionable feedback, helping learners identify grammatical gaps, assess language proficiency, and guide improvement.
As researchers increasingly adopt LLMs as writing assistants, generating high-quality research paper introductions remains both challenging and essential. We introduce Scientific Introduction Generation (SciIG), a task that evaluates LLMs' ability to produce coherent introductions from titles, abstracts, and related works. Curating new datasets from NAACL 2025 and ICLR 2025 papers, we assess five state-of-the-art models, including both open-source (DeepSeek-v3, Gemma-3-12B, LLaMA 4-Maverick, MistralAI Small 3.1) and closed-source GPT-4o systems, across multiple dimensions: lexical overlap, semantic similarity, content coverage, faithfulness, consistency, citation correctness, and narrative quality. Our comprehensive framework combines automated metrics with LLM-as-a-judge evaluations. Results demonstrate LLaMA-4 Maverick's superior performance on most metrics, particularly in semantic similarity and faithfulness. Moreover, three-shot prompting consistently outperforms fewer-shot approaches. These findings provide practical insights into developing effective research writing assistants and set realistic expectations for LLM-assisted academic writing. To foster reproducibility and fu
Crossbar architectures have long been seen as a promising foundation for in-memory computing, using memristor arrays for high-density, energy-efficient analog computation. However, this conventional architecture suffers from a fundamental limitation: the inability to perform parallel write operations due to the sneak path problem. This arises from the structural overlap of read and write paths, forcing sequential or semi-parallel updates and severely limiting scalability. To address this, we introduce a new memristor design that decouples read and write operations at the device level. This design enables orthogonal conductive paths, and employs a reversible ion doping mechanism, inspired by lithium-ion battery principles, to modulate resistance states independently of computation. Fabricated devices exhibit near-ideal memristive characteristics and stable performance under isolated read/write conditions.