Benchmarks for large language models (LLMs) have progressed from snippet-level function generation to repository-level issue resolution, yet they overwhelmingly target implementation correctness. Software architecture tasks remain under-specified and difficult to compare across models, despite their central role in maintaining and evolving complex systems. We present ArchBench, the first unified platform for benchmarking LLM capabilities on software architecture tasks. ArchBench provides a command-line tool with a standardized pipeline for dataset download, inference with trajectory logging, and automated evaluation, alongside a public web interface with an interactive leaderboard. The platform is built around a plugin architecture where each task is a self-contained module, making it straightforward for the community to contribute new architectural tasks and evaluation results. We use the term LLMs broadly to encompass generative AI (GenAI) solutions for software engineering, including both standalone models and LLM-based coding agents equipped with tools. Both the CLI tool and the web platform are openly available to support reproducible research and community-driven growth of ar
Software architecture documentation is essential for system comprehension, yet it is often unavailable or incomplete. While recent LLM-based techniques can generate documentation from code, they typically address local artifacts rather than producing coherent, system-level architectural descriptions. This paper presents a structured process for automatically generating system-level architectural documentation directly from GitHub repositories using Large Language Models. The process, called CIAO (Code In Architecture Out), defines an LLM-based workflow that takes a repository as input and produces system-level architectural documentation following a template derived from ISO/IEC/IEEE 42010, SEI Views \& Beyond, and the C4 model. The resulting documentation can be directly added to the target repository. We evaluated the process through a study with 22 developers, each reviewing the documentation generated for a repository they had contributed to. The evaluation shows that developers generally perceive the produced documentation as valuable, comprehensible, and broadly accurate with respect to the source code, while also highlighting limitations in diagram quality, high-level co
Current fault-tolerant quantum computer (FTQC) architectures utilize several encoding techniques to enable reliable logical operations with restricted qubit connectivity. However, such logical operations demand additional memory overhead to ensure fault tolerance. Since the main obstacle to practical quantum computing is the limited qubit count, our primary mission is to design floorplans that can reduce memory overhead without compromising computational capability. Despite extensive efforts to explore FTQC architectures, even the current state-of-the-art floorplan strategy devotes 50% of memory space to this overhead, not to data storage, to ensure unit-time random access to all logical qubits. In this paper, we propose an FTQC architecture based on a novel floorplan strategy, Load/Store Quantum Computer Architecture (LSQCA), which can achieve almost 100% memory density. The idea behind our architecture is to separate all memory regions into small computational space called Computational Registers (CR) and space-efficient memory space called Scan-Access Memory (SAM). We define an instruction set for these abstract structures and provide concrete designs named point-SAM and line-SA
The prevailing scaling paradigm of Large Language Models (LLMs) rests on a substrate of "Fuzzy" floating-point arithmetic. To mitigate the inherent instability of this approximate foundation, modern architectures have erected a complex scaffolding of structural and numerical heuristics--Complex Residuals, Pre-RMSNorm, Attention Scaling, and Gradient Clipping--consuming significant compute solely to prevent numerical collapse. We propose a paradigm shift to the "Exact". We introduce the Halo Architecture, grounded in the Rational Field (Q) and powered by a custom Exact Inference Unit (EIU). To resolve the exponential bit-width growth of rational arithmetic, Halo employs a Dual-Ring Topology that unifies two complementary control mechanisms: (1) The Micro-Ring (Continuum Maintenance), which strictly bounds memory complexity via Diophantine Approximation; and (2) The Macro-Ring (Symbolic Alignment), which enforces logical consistency via periodic state collapse. This stable dual-ring substrate allows for the "Great Dismantling" of numerical scaffolding, reducing the Transformer block to its "Clean" algebraic form (Tabula Rasa). Furthermore, we verify the "Efficiency Paradox": the elim
The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can effectively address these challenges, enabling cost-efficient training and inference at scale. This paper presents an in-depth analysis of the DeepSeek-V3/R1 model architecture and its AI infrastructure, highlighting key innovations such as Multi-head Latent Attention (MLA) for enhanced memory efficiency, Mixture of Experts (MoE) architectures for optimized computation-communication trade-offs, FP8 mixed-precision training to unlock the full potential of hardware capabilities, and a Multi-Plane Network Topology to minimize cluster-level network overhead. Building on the hardware bottlenecks encountered during DeepSeek-V3's development, we engage in a broader discussion with academic and industry peers on potential future hardware directions, including precise low-precision computation units, scale-up and scale-out convergence, and innovations in low-latency communicatio
Edge caching is an emerging technology that empowers caching units at edge nodes, allowing users to fetch contents of interest that have been pre-cached at the edge nodes. The key to pre-caching is to maximize the cache hit percentage for cached content without compromising users' privacy. In this letter, we propose a federated learning (FL) assisted edge caching scheme based on lightweight architecture denoising diffusion probabilistic model (LDPM). Our simulation results verify that our proposed scheme achieves a higher cache hit percentage compared to existing FL-based methods and baseline methods.
The first years of the 2000s led to an inflection point in computer architectures: while the number of available transistors on a chip continued to grow, crucial transistor scaling properties started to break down and result in increasing power consumption, while aggressive single-core performance optimizations were resulting in diminishing returns due to inherent limits in instruction-level parallelism. This led to the rise of multicore CPU architectures, which are now commonplace in modern computers at all scales. In this chapter, we discuss the evolution of multicore CPUs since their introduction. Starting with a historic overview of multiprocessing, we explore the basic microarchitecture of a multicore CPU, key challenges resulting from shared memory resources, operating system modifications to optimize multicore CPU support, popular metrics for multicore evaluation, and recent trends in multicore CPU design.
Several approaches have recently used automated techniques to generate architecture design alternatives by means of optimization techniques. These approaches aim at improving an initial architecture with respect to quality aspects, such as performance, reliability, or maintainability. In this context, each optimization experiment usually produces a different set of architecture alternatives that is characterized by specific settings. As a consequence, the designer is left with the task of comparing such sets to identify the settings that lead to better solution sets for the problem. To assess the quality of solution sets, multi-objective optimization commonly relies on quality indicators. Among these, the quality indicator for the maximum spread estimates the diversity of the generated alternatives, providing a measure of how much of the solution space has been explored. However, the maximum spread indicator is computed only on the objective space and does not consider architectural information (e.g., components structure, design decisions) from the architectural space. In this paper, we propose a quality indicator for the spread that assesses the diversity of alternatives by takin
Processing-in-memory (PIM) has been explored for decades by computer architects, yet it has never seen the light of day in real-world products due to their high design overheads and lack of a killer application. With the advent of critical memory-intensive workloads, several commercial PIM technologies have been introduced to the market ranging from domain-specific PIM architectures to more general-purpose PIM architectures. In this work, we deepdive into UPMEM's commercial PIM technology, a general-purpose PIM-enabled parallel architecture that is highly programmable. Our first key contribution is the development of a flexible simulation framework for PIM. The simulator we developed (aka PIMulator) enables the compilation of UPMEM-PIM source codes into its compiled machine-level instructions, which are subsequently consumed by our cycle-level performance simulator. Using PIMulator, we demystify UPMEM's PIM design through a detailed characterization study. Building on top of our characterization, we conduct a series of case studies to pathfind important architectural features that we deem will be critical for future PIM architectures to support
The rise of the Internet of Things and Cyber-Physical Systems has introduced new challenges on ensuring secure and robust communication. The growing number of connected devices increases network complexity, leading to higher latency and traffic. Distributed computing architectures (DCAs) have gained prominence to address these issues. This shift has significantly expanded the attack surface, requiring additional security measures to protect all components -- from sensors and actuators to edge nodes and central servers. Recent incidents highlight the difficulty of this task: Cyberattacks, like distributed denial of service attacks, continue to pose severe threats and cause substantial damage. Implementing a holistic defense mechanism remains an open challenge, particularly against attacks that demand both enhanced resilience and rapid response. Addressing this gap requires innovative solutions to enhance the security of DCAs. In this work, we present our holistic self-adaptive security framework which combines different adaptation strategies to create comprehensive and efficient defense mechanisms. We describe how to incorporate the framework into a real-world use case scenario and
DNA sequence alignment is an important workload in computational genomics. Reference-guided DNA assembly involves aligning many read sequences against candidate locations in a long reference genome. To reduce the computational load of this alignment, candidate locations can be pre-filtered using simpler alignment algorithms like edit distance. Prior work has explored accelerating filtering on simulated compute-in-DRAM, due to the massive parallelism of compute-in-memory architectures. In this paper, we present work-in-progress on accelerating filtering using a commercial compute-in-SRAM accelerator. We leverage the recently released Gemini accelerator platform from GSI Technology, which is the first, to our knowledge, commercial-scale compute-in-SRAM system. We accelerate the Myers' bit-parallel edit distance algorithm, producing average speedups of 14.1x over single-core CPU performance. Individual query/candidate alignments produce speedups of up to 24.1x. These early results suggest this novel architecture is well-suited to accelerating the filtering step of sequence-to-sequence DNA alignment.
This report defines a carrier-grade split architecture based on requirements identified during the SPARC project. It presents the SplitArchitecture proposal, the SPARC concept for Software Defined Networking (SDN) introduced for large-scale wide area networks such as access/aggregation networks, and evaluates technical issues against architectural trade-offs. First we present the control and management architecture of the proposed SplitArchitecture. Here, we discuss a recursive control architecture consisting of hierarchically stacked control planes and provide initial considerations regarding network management integration to SDN in general and SplitArchitecture in particular. Next, OpenFlow extensions to support the carrier-grade SplitArchitecture are discussed. These are: a) Openness & Extensibility; b) Virtualization; c) OAM; d) Resiliency approaches; e) Bootstrapping and topology discovery; f) Service creation; g) Energy-efficient networking; h) QoS aspects; and i) Multilayer aspects. In addition, we discuss selected deployment and adoption scenarios faced by modern operator networks, such as service creation scenarios and peering aspects, i.e., how to interconnect with le
The effectiveness of shortcut/skip-connection has been widely verified, which inspires massive explorations on neural architecture design. This work attempts to find an effective way to design new network architectures. It is discovered that the main difference between network architectures can be reflected in their recursion formulas. Based on this, a methodology is proposed to design novel network architectures from the perspective of mathematical formulas. Afterwards, a case study is provided to generate an improved architecture based on ResNet. Furthermore, the new architecture is compared with ResNet and then tested on ResNet-based networks. Massive experiments are conducted on CIFAR and ImageNet, which witnesses the significant performance improvements provided by the architecture.
Innovation and standardization in 5G have brought advancements to every facet of the cellular architecture. This ranges from the introduction of new frequency bands and signaling technologies for the radio access network (RAN), to a core network underpinned by micro-services and network function virtualization (NFV). However, like any emerging technology, the pace of real-world deployments does not instantly match the pace of innovation. To address this discrepancy, one of the key aspects under continuous development is the RAN with the aim of making it more open, adaptive, functional, and easy to manage. In this paper, we highlight the transformative potential of embracing novel cellular architectures by transitioning from conventional systems to the progressive principles of Open RAN. This promises to make 6G networks more agile, cost-effective, energy-efficient, and resilient. It opens up a plethora of novel use cases, ranging from ubiquitous support for autonomous devices to cost-effective expansions in regions previously underserved. The principles of Open RAN encompass: (i) a disaggregated architecture with modular and standardized interfaces; (ii) cloudification, programmabi
We present several enhancements to the open-source ESP platform to support flexible and efficient on-chip communication for programmable accelerators in heterogeneous SoCs. These enhancements include 1) a flexible point-to-point communication mechanism between accelerators, 2) a multicast NoC that supports data forwarding to multiple accelerators simultaneously, 3) accelerator synchronization leveraging the SoC's coherence protocol, 4) an accelerator interface that offers fine-grained control over the communication mode used, and 5) an example ISA extension to support our enhancements. Our solution adds negligible area to the SoC architecture and requires minimal changes to the accelerators themselves. We have validated most of these features in complex FPGA prototypes and plan to include them in the open-source release of ESP in the coming months.
The estimation and improvement of quality attributes in software architectures is a challenging and time-consuming activity. On modern software applications, a model-based representation is crucial to face the complexity of such activity. One main challenge is that the improvement of distinctive quality attributes may require contrasting refactoring actions on the architecture, for instance when looking for trade-off between performance and reliability (or other non-functional quality attributes). In such cases, multi-objective optimization can provide the designer with a more complete view on these trade-offs and, consequently, can lead to identify suitable refactoring actions that take into account independent or even competing objectives. In this paper, we present open challenges and research directions to fill current gaps in the context of multi-objective software architecture optimization.
Software Architecture, from definition to maintenance and evolution, is a complex aspect of software development and, consequently, a challenging subject when it comes to teaching it, and learning it. Many research efforts have been devoted to designing teaching approaches, strategies and tools. Most of them, however, focus on the knowledge itself and the ways to convey it to students, rather than on the different learning styles of students themselves. Teaching methods which predominantly rely on verbal and written communication, are very well aligned with some learning styles. However, students with learning styles that benefit more from physical activity or first-hand experience, need to defer to cognitive processes that are less natural to them. In this work, we propose an innovative use of role-playing as teaching strategy for architecture models of reference (i.e. layered, pipe and filter, client-server, etc.). This role-playing of different software architectures, in which students play the part of specific components in the system, intends to complement other classical teaching materials, such as in-person or recorded lectures, lab assignments, or development projects. Addr
Neural architecture search (NAS) is a recent methodology for automating the design of neural network architectures. Differentiable neural architecture search (DARTS) is a promising NAS approach that dramatically increases search efficiency. However, it has been shown to suffer from performance collapse, where the search often leads to detrimental architectures. Many recent works try to address this issue of DARTS by identifying indicators for early stopping, regularising the search objective to reduce the dominance of some operations, or changing the parameterisation of the search problem. In this work, we hypothesise that performance collapses can arise from poor local optima around typical initial architectures and weights. We address this issue by developing a more global optimisation scheme that is able to better explore the space without changing the DARTS problem formulation. Our experiments show that our changes in the search algorithm allow the discovery of architectures with both better test performance and fewer parameters.
Machine learning is a prevalent approach to tame the complexity of design space exploration for domain-specific architectures. Using ML for design space exploration poses challenges. First, it's not straightforward to identify the suitable algorithm from an increasing pool of ML methods. Second, assessing the trade-offs between performance and sample efficiency across these methods is inconclusive. Finally, lack of a holistic framework for fair, reproducible, and objective comparison across these methods hinders progress of adopting ML-aided architecture design space exploration and impedes creating repeatable artifacts. To mitigate these challenges, we introduce ArchGym, an open-source gym and easy-to-extend framework that connects diverse search algorithms to architecture simulators. To demonstrate utility, we evaluate ArchGym across multiple vanilla and domain-specific search algorithms in designing custom memory controller, deep neural network accelerators, and custom SoC for AR/VR workloads, encompassing over 21K experiments. Results suggest that with unlimited samples, ML algorithms are equally favorable to meet user-defined target specification if hyperparameters are tuned;
Current architecture proposals within standards development organizations such as ETSI and 3GPP enable sensing capabilities in mobile networks; however, they do not include a repository for storing sensing data. Such a repository can be used for AI model training and to complement ongoing sensing service provisioning by improving efficiency and accuracy. One way of realizing this is through the fusion of historical sensing data with live sensing data. In this paper, we study historical and live sensing data fusion for Integrated Sensing and Communication in future 6G systems and introduce a Sensing Data Storage Function to store historical sensing data and sensing results. We show how the Sensing Data Storage Function can be used with other network functions in a 6G architecture proposition for Integrated Sensing and Communication. We validate our proposal with a measurement model and show performance improvements in terms of detection probability and false-alarm rate. The network functionality to fuse and process sensing data combines live sensing measurements with previously sensed historical sensing data using a map-aware hard filter that rejects detections consistent with known