共找到 20 条结果
We consider the problem of memory holes in slab allocators, where an item entered into memory occupies more memory than it actually requires due to a difference between the nearest larger slab class size and the size of the entered item. We solve this problem by using a greedy algorithm that analyses the pattern of the sizes of items previously entered into the memory and accordingly re-configuring the default slab classes to better suit the learned traffic pattern to minimize memory holes. Using this approach for a consistent data pattern, in our findings, has yielded significant reductions in memory wastage. We consider Memcached as it is one of the most widely used implementations of slab allocators today, and has native support to reconfigure its default slab classes.
High-Level Synthesis (HLS) aspires to raise the level of abstraction in hardware design without sacrificing hardware efficiency. It has so far been successfully employed in signal and video processing but has found only limited use in other areas. This paper utilizes a commercial HLS tool, namely Vivado(R) HLS, to implement the processing of a common data center application, the Key-Value Store (KVS) application memcached, as a deeply pipelined dataflow architecture. We compared our results to a fully equivalent RTL implementation done previously in our group and found that it matches its performance, yields tangible improvements in latency (between 7-30%) and resource consumption (22% in LUTs and 35% in registers), all while requiring 3x less lines of code and 2x less development time. The implementation was validated in hardware on a Xilinx(R) VC709 development board, meeting timing requirements for 10Gbps line rate processing.
We present the first comprehensive analysis of ARM MTE hardware performance on four different microarchitectures: ARM Big (A7x), Little (A5x), and Performance (Cortex-X) cores on the Google Pixel 8 and Pixel 9, and on Ampere Computing's AmpereOne CPU core. We also include preliminary analysis of MTE on Apple's M5 chip. We investigate performance in MTE's primary application -- probabilistic memory safety -- on both SPEC CPU benchmarks and in server workloads such as RocksDB, Nginx, PostgreSQL, and Memcached. While MTE often exhibits modest overheads, we also see performance slowdowns up to 6.64x on certain benchmarks. We identify the microarchitectural cause of these overheads and where they can be addressed in future processors. We then analyze MTE's performance for more specialized security applications such as memory tracing, time-of-check time-of-use prevention, sandboxing, and CFI. In some of these cases, MTE offers significant advantages today, while the benefits for other cases are negligible or will depend on future hardware. Finally, we explore where prior work characterizing MTE performance has either been incomplete or incorrect due to methodological or experimental erro
When a cache miss fetches from cloud object storage, the bill is per GET request and per byte of egress, not latency. Classic caching minimizes the miss rate, the wrong objective: a rarely but expensively fetched object can cost thousands of times more dollars than a frequently but cheaply fetched one. Generalized-caching theory bounds the miss-cost objective, but no reported benchmark measures how far deployed heuristics sit from the dollar-optimal offline policy on real cloud prices. We supply that reference. For uniform-size page caches with heterogeneous miss costs the offline dollar-optimum is exact in polynomial time via an integral interval linear program -- validated against brute force; variable sizes are NP-hard, so we extend the flow-based offline bound from the hit-ratio objective to dollars (cost-FOO), tight to about four percent. Against this reference we find: (i) a heterogeneity-regret law -- LRU's dollar-regret rises with miss-cost dispersion (Spearman 0.87) while cost-aware GreedyDual cuts it to roughly a tenth; (ii) a contention frontier -- GreedyDual's residual regret collapses to near zero exactly when the budget fits the expensive working set, and is the open
Advances in storage technology have introduced Non-Volatile Memory, NVM, as a new storage medium. NVM, along with Dynamic Random Access Memory (DRAM), Solid State Disk (SSD), and Disk present a system designer with a wide array of options in designing caching middleware. Moreover, design decisions to replicate a data item in more than one level of a caching memory hierarchy may enhance the overall system performance with a faster recovery time in the event of a memory failure. Given a fixed budget, the key configuration questions are: Which storage media should constitute the memory hierarchy? What is the storage capacity of each hierarchy? Should data be replicated or partitioned across the different levels of the hierarchy? We model these cache configuration questions as an instance of the Multiple Choice Knapsack Problem (MCKP). This model is guided by the specification of each type of memory along with an application's database characteristics and its workload. Although MCKP is NP-complete, its linear programming relaxation is efficiently solvable and can be used to closely approximate the optimal solution. We use the resulting simple algorithm to evaluate design tradeoffs in t
When compared to blocking concurrency, non-blocking concurrency can provide higher performance in parallel shared-memory contexts, especially in high contention scenarios. This paper proposes FLeeC, an application-level cache system based on Memcached, which leverages re-designed data structures and non-blocking (or lock-free) concurrency to improve performance by allowing any number of concurrent writes and reads to its main data structures, even in high-contention scenarios. We discuss and evaluate its new algorithms, which allow a lock-free eviction policy and lock-free fast lookups. FLeeC can be used as a plug-in replacement for the original Memcached, and its new algorithms and concurrency control strategies result in considerable performance improvements (up to 6x).
We present Trust<T>, a general, type- and memory-safe alternative to locking in concurrent programs. Instead of synchronizing multi-threaded access to an object of type T with a lock, the programmer may place the object in a Trust<T>. The object is then no longer directly accessible. Instead a designated thread, the object's trustee, is responsible for applying any requested operations to the object, as requested via the Trust<T> API. Locking is often said to offer a limited throughput per lock. Trust<T> is based on delegation, a message-passing technique which does not suffer this per-lock limitation. Instead, per-object throughput is limited by the capacity of the object's trustee, which is typically considerably higher. Our evaluation shows Trust<T> consistently and considerably outperforming locking where lock contention exists, with up to 22x higher throughput in microbenchmarks, and 5-9x for a home grown key-value store, as well as memcached, in situations with high lock contention. Moreover, Trust<T> is competitive with locks even in the absence of lock contention.
Cost Adaptive Multi-queue eviction Policy (CAMP) is an algorithm for a general purpose key-value store (KVS) that manages key-value pairs computed by applications with different access patterns, key-value sizes, and varying costs for each key-value pair. CAMP is an approximation of the Greedy Dual Size (GDS) algorithm that can be implemented as efficiently as LRU. In particular, CAMP's eviction policies are as effective as those of GDS but require only a small fraction of the updates to an internal data structure in order to make those decisions. Similar to an implementation of LRU using queues, it adapts to changing workload patterns based on the history of requests for different key-value pairs. It is superior to LRU because it considers both the size and cost of key-value pairs to maximize the utility of the available memory across competing applications. We compare CAMP with both LRU and an alternative that requires human intervention to partition memory into pools and assign grouping of key-value pairs to different pools. The results demonstrate CAMP is as fast as LRU while outperforming both LRU and the pooled alternative. We also present results from an implementation of CAM
Memory management operations that modify page-tables, typically performed during memory allocation/deallocation, are infamous for their poor performance in highly threaded applications, largely due to process-wide TLB shootdowns that the OS must issue due to the lack of hardware support for TLB coherence. We study these operations in NUMA settings, where we observe up to 40x overhead for basic operations such as munmap or mprotect. The overhead further increases if page-table replication is used, where complete coherent copies of the page-tables are maintained across all NUMA nodes. While eager system-wide replication is extremely effective at localizing page-table reads during address translation, we find that it creates additional penalties upon any page-table changes due to the need to maintain all replicas coherent. In this paper, we propose a novel page-table management mechanism, called numaPTE, to enable transparent, on-demand, and partial page-table replication across NUMA nodes in order to perform address translation locally, while avoiding the overheads and scalability issues of system-wide full page-table replication. We then show that numaPTE's precise knowledge of page
"As many of us know from bitter experience, the policies provided in extant operating systems, which are claimed to work well and behave fairly 'on the average', often fail to do so in the special cases important to us" [Wulf et al. 1974]. Written in 1974, these words motivated moving policy decisions into user-space. Today, as warehouse-scale computers (WSCs) have become ubiquitous, it is time to move policy decisions away from individual servers altogether. Built-in policies are complex and often exhibit bad performance at scale. Meanwhile, the highly-controlled WSC setting presents opportunities to improve performance and predictability. We propose moving all policy decisions from the OS kernel to the cluster manager (CM), in a new paradigm we call Grape CM. In this design, the role of the kernel is reduced to monitoring, sending metrics to the CM, and executing policy decisions made by the CM. The CM uses metrics from all kernels across the WSC to make informed policy choices, sending commands back to each kernel in the cluster. We claim that Grape CM will improve performance, transparency, and simplicity. Our initial experiments show how the CM can identify the optimal set of
Many concurrent programs assign priorities to threads to improve responsiveness. When used in conjunction with synchronization mechanisms such as mutexes and condition variables, however, priorities can lead to priority inversions, in which high-priority threads are delayed by low-priority ones. Priority inversions in the use of mutexes are easily handled using dynamic techniques such as priority inheritance, but priority inversions in the use of condition variables are not well-studied and dynamic techniques are not suitable. In this work, we use a combination of static and dynamic techniques to prevent priority inversion in code that uses mutexes and condition variables. A type system ensures that condition variables are used safely, even while dynamic techniques change thread priorities at runtime to eliminate priority inversions in the use of mutexes. We prove the soundness of our system, using a model of priority inversions based on cost models for parallel programs. To show that the type system is practical to implement, we encode it within the type systems of Rust and C++, and show that the restrictions are not overly burdensome by writing sizeable case studies using these e
In this paper we introduce Timeloops a novel technique for automatically learning system call filtering policies for containerized microservices applications. At run-time, Timeloops automatically learns which system calls a program should be allowed to invoke while rejecting attempts to call spurious system calls. Further, Timeloops addresses many of the shortcomings of state-of-the-art static analysis-based techniques, such as the ability to generate tight filters for programs written in interpreted languages such as PHP, Python, and JavaScript. Timeloops has a simple and robust implementation because it is mainly built out of commodity, and proven, technologies such as seccomp-BPF, systemd, and Podman containers, with fewer than 500 lines of code. We demonstrate the utility of Timeloops by learning system calls for individual services and two microservices benchmark applications, which utilize popular technologies like Python Flask, Nginx (with PHP and Lua modules), Apache Thrift, Memcached, Redis, and MongoDB. Further, the amortized performance of Timeloops is similar to that of an unhardened system while producing a smaller system call filter than state-of-the-art static analys
We propose uBFT, the first State-Machine Replication (SMR) system to achieve microsecond-scale latency in data centers, while using only $2f{+}1$ replicas to tolerate $f$ Byzantine failures. The Byzantine Fault Tolerance (BFT) provided by uBFT is essential as pure crashes appear to be a mere illusion with real-life systems reportedly failing in many unexpected ways. uBFT relies on a small non-tailored trusted computing base -- disaggregated memory -- and consumes a practically bounded amount of memory (both local and disaggregated). uBFT is based on a novel abstraction called Consistent Tail Broadcast, which we use to prevent equivocation while bounding memory. We implement uBFT using RDMA-based disaggregated memory and obtain an end-to-end latency of as little as 10us. This is at least 50$\times$ faster than MinBFT , a state of the art $2f{+}1$ BFT SMR based on Intel's SGX. We use uBFT to replicate two key-value stores (Memcached and Redis), as well as a financial order matching engine (Liquibook). These applications have low latency (up to 20us) and become Byzantine tolerant with as little as 10us more. The price for uBFT is a small amount of reliable disaggregated memory (less t
User-facing applications running in modern datacenters exhibit irregular request patterns and are implemented using a multitude of services with tight latency requirements. These characteristics render ineffective existing energy conserving techniques when processors are idle due to the long transition time from a deep idle power state (C-state). While prior works propose management techniques to mitigate this inefficiency, we tackle it at its root with AgileWatts (AW): a new deep C-state architecture optimized for datacenter server processors targeting latency-sensitive applications. AW is based on three key ideas. First, AW eliminates the latency overhead of saving/restoring the core context (i.e., micro-architectural state) when powering-off/-on the core in a deep idle power state by i) implementing medium-grained power-gates, carefully distributed across the CPU core, and ii) retaining context in the power-ungated domain. Second, AW eliminates the flush latency overhead (several tens of microseconds) of the L1/L2 caches when entering a deep idle power state by keeping L1/L2 cache content power-ungated. A minimal control logic also remains power-ungated to serve cache coherence
Systems that require high-throughput and fault tolerance, such as key-value stores and databases, are looking to persistent memory to combine the performance of in-memory systems with the data-consistent fault-tolerance of nonvolatile stores. Persistent memory devices provide fast bytea-ddressable access to non-volatile memory. We analyze the design space when integrating persistent memory into in-memory key value stores and quantify performance tradeoffs between throughput, latency, and and recovery time. Previous works have explored many design choices, but did not quantify the tradeoffs. We implement persistent memory support in Redis and Memcached, adapting the data structures of each to work in two modes: (1) with all data in persistent memory and (2) a hybrid mode that uses persistent memory for key/value data and non-volatile memory for indexing and metadata. Our experience reveals three actionable design principles that hold in Redis and Memcached, despite their very different implementations. We conclude that the hybrid design increases throughput and decreases latency at a minor cost in recovery time and code complexity
Distributed caching systems (e.g., Memcached) are widely used by service providers to satisfy accesses by millions of concurrent clients. Given their large-scale, modern distributed systems rely on a middleware layer to manage caching nodes, to make applications easier to develop, and to apply load balancing and replication strategies. In this work, we performed a dependability evaluation of three popular middleware platforms, namely Twemproxy by Twitter, Mcrouter by Facebook, and Dynomite by Netflix, to assess availability and performance under faults, including failures of Memcached nodes and congestion due to unbalanced workloads and network link bandwidth bottlenecks. We point out the different availability and performance trade-offs achieved by the three platforms, and scenarios in which few faulty components cause cascading failures of the whole distributed system.
Applications running in Trusted Execution Environments (TEEs) commonly use untrusted external services such as host File System. Adversaries may maliciously alter the normal service behavior to trigger subtle application bugs that would have never occurred under correct service operation, causing data leaks and integrity violations. Unfortunately, existing manual protections are incomplete and ad-hoc, whereas formally-verified ones require special expertise. We introduce GateKeeper, a framework to develop mitigations and vulnerability checkers for such attacks by leveraging lightweight formal models of untrusted services. With the attack seen as a violation of a services' functional correctness, GateKeeper takes a novel approach to develop a comprehensive model of a service without requiring formal methods expertise. We harness available testing suites routinely used in service development to tighten the model to known correct service implementation. GateKeeper uses the resulting model to automatically generate (1) a correct-by-construction runtime service validator in C that is linked with a trusted application and guards each service invocation to conform to the model; and (2) a
Distributed Reflective Denial of Service (DRDoS) attacks are an immanent threat to Internet services. The potential scale of such attacks became apparent in March 2018 when a memcached-based attack peaked at 1.7 Tbps. Novel services built upon UDP increase the need for automated mitigation mechanisms that react to attacks without prior knowledge of the actual application protocols used. With the flexibility that software-defined networks offer, we developed a new approach for defending against DRDoS attacks; it not only protects against arbitrary DRDoS attacks but is also transparent for the attack target and can be used without assistance of the target host operator. The approach provides a robust mitigation system which is protocol-agnostic and effective in the defense against DRDoS attacks.
It is becoming increasingly popular for distributed systems to exploit offload to reduce load on the CPU. Remote Direct Memory Access (RDMA) offload, in particular, has become popular. However, RDMA still requires CPU intervention for complex offloads that go beyond simple remote memory access. As such, the offload potential is limited and RDMA-based systems usually have to work around such limitations. We present RedN, a principled, practical approach to implementing complex RDMA offloads, without requiring any hardware modifications. Using self-modifying RDMA chains, we lift the existing RDMA verbs interface to a Turing complete set of programming abstractions. We explore what is possible in terms of offload complexity and performance with a commodity RDMA NIC. We show how to integrate these RDMA chains into applications, such as the Memcached key-value store, allowing us to offload complex tasks such as key lookups. RedN can reduce the latency of key-value get operations by up to 2.6x compared to state-of-the-art KV designs that use one-sided RDMA primitives (e.g., FaRM-KV), as well as traditional RPC-over-RDMA approaches. Moreover, compared to these baselines, RedN provides per