共找到 20 条结果
Implementations of Broadcast based on some information dissemination algorithm -- e.g., gossip or tree-based communication -- followed by a correction algorithm has been proposed previously. This work describes an approach to apply a similar idea to Reduce. In it, a correction-like communication phase precedes a tree-based phase. This provides a Reduce algorithm which is tolerant to a number of failed processes. Semantics of the resulting algorithm are provided and proven. Based on these results, Broadcast and Reduce are combined to provide Allreduce.
In the All-Reduce problem, each one of the K nodes holds an input and wishes to compute the sum of all K inputs through a communication network where each pair of nodes is connected by a parallel link with arbitrary bandwidth. The computation rate of All-Reduce is defined as the number of sum instances that can be computed over each network use. For the computation rate, we provide a cut-set upper bound and a linear programming lower bound based on time (bandwidth) sharing over all schemes that first perform Reduce (aggregating all inputs at one node) and then perform Broadcast (sending the sum from that node to all other nodes). Specializing the two general bounds gives us the optimal computation rate for a class of communication networks and the best-known rate bounds (where the upper bound is no more than twice of the lower bound) for cyclic, complete, and hypercube networks.
Efficient Reduce and AllReduce communication collectives are a critical cornerstone of high-performance computing (HPC) applications. We present the first systematic investigation of Reduce and AllReduce on the Cerebras Wafer-Scale Engine (WSE). This architecture has been shown to achieve unprecedented performance both for machine learning workloads and other computational problems like FFT. We introduce a performance model to estimate the execution time of algorithms on the WSE and validate our predictions experimentally for a wide range of input sizes. In addition to existing implementations, we design and implement several new algorithms specifically tailored to the architecture. Moreover, we establish a lower bound for the runtime of a Reduce operation on the WSE. Based on our model, we automatically generate code that achieves near-optimal performance across the whole range of input sizes. Experiments demonstrate that our new Reduce and AllReduce algorithms outperform the current vendor solution by up to 3.27x. Additionally, our model predicts performance with less than 4% error. The proposed communication collectives increase the range of HPC applications that can benefit fro
We present a version of the REDUCE computer algebra system as it was in the early 1970s. We show how this historical version of REDUCE may be built and run in very modest present-day environments and outline some of its capabilities.
For nonconvex objective functions, including those found in training deep neural networks, stochastic gradient descent (SGD) with momentum is said to converge faster and have better generalizability than SGD without momentum. In particular, adding momentum is thought to reduce stochastic noise. To verify this, we estimated the magnitude of gradient noise by using convergence analysis and an optimal batch size estimation formula and found that momentum does not reduce gradient noise. We also analyzed the effect of search direction noise, which is stochastic noise defined as the error between the search direction of the optimizer and the steepest descent direction, and found that it inherently smooths the objective function and that momentum does not reduce search direction noise either. Finally, an analysis of the degree of smoothing introduced by search direction noise revealed that adding momentum offers limited advantage to SGD.
We advocate the Loop-of-stencil-reduce pattern as a means of simplifying the implementation of data-parallel programs on heterogeneous multi-core platforms. Loop-of-stencil-reduce is general enough to subsume map, reduce, map-reduce, stencil, stencil-reduce, and, crucially, their usage in a loop in both data-parallel and streaming applications, or a combination of both. The pattern makes it possible to deploy a single stencil computation kernel on different GPUs. We discuss the implementation of Loop-of-stencil-reduce in FastFlow, a framework for the implementation of applications based on the parallel patterns. Experiments are presented to illustrate the use of Loop-of-stencil-reduce in developing data-parallel kernels running on heterogeneous systems.
Gradient Smoothing is an efficient approach to reducing noise in gradient-based model explanation method. SmoothGrad adds Gaussian noise to mitigate much of these noise. However, the crucial hyper-parameter in this method, the variance $σ$ of Gaussian noise, is set manually or with heuristic approach. However, it results in the smoothed gradients still containing a certain amount of noise. In this paper, we aim to interpret SmoothGrad as a corollary of convolution, thereby re-understanding the gradient noise and the role of $σ$ from the perspective of confidence level. Furthermore, we propose an adaptive gradient smoothing method, AdaptGrad, based on these insights. Through comprehensive experiments, both qualitative and quantitative results demonstrate that AdaptGrad could effectively reduce almost all the noise in vanilla gradients compared with baselines methods. AdaptGrad is simple and universal, making it applicable for enhancing gradient-based interpretability methods for better visualization.
The existing buildings and building construction sectors together are responsible for over one-third of the total global energy consumption and nearly 40% of total greenhouse gas (GHG) emissions. GHG emissions from the building sector are made up of embodied emissions and operational emissions. Recognizing the importance of reducing energy use and emissions associated with the building sector, governments have introduced policies, standards, and design guidelines to improve building energy performance and reduce GHG emissions associated with operating buildings. However, policy initiatives that reduce embodied emissions of the existing building sector are lacking. This research aims to develop policy strategies to reduce embodied carbon emissions in retrofits. In order to achieve this goal, this research conducted a literature review and identification of policies and financial incentives in British Columbia (BC) for reducing overall GHG emissions from the existing building sector. Then, this research analyzed worldwide policies and incentives that reduce embodied carbon emissions in the existing building sector. After reviewing the two categories of retrofit policies, the author i
In physics, entanglement 'reduces' the entropy of an entity, because the (von Neumann) entropy of, e.g., a composite bipartite entity in a pure entangled state is systematically lower than the entropy of the component sub-entities. We show here that this 'genuinely non-classical reduction of entropy as a result of composition' also holds whenever two concepts combine in human cognition and, more generally, it is valid in human culture. We exploit these results and make a 'new hypothesis' on the nature of entanglement, namely, the production of entanglement in the preparation of a composite entity can be seen as a 'dynamical process of collaboration between its sub-entities to reduce uncertainty', because the composite entity is in a pure state while its sub-entities are in a non-pure, or density, state, as a result of the preparation. We identify within the nature of this entanglement a mechanism of contextual updating and illustrate the mechanism in the example we analyze. Our hypothesis naturally explains the 'non-classical nature' of some quantum logical connectives, as due to Bell-type correlations.
Fault-aware retraining has emerged as a prominent technique for mitigating permanent faults in Deep Neural Network (DNN) hardware accelerators. However, retraining leads to huge overheads, specifically when used for fine-tuning large DNNs designed for solving complex problems. Moreover, as each fabricated chip can have a distinct fault pattern, fault-aware retraining is required to be performed for each chip individually considering its unique fault map, which further aggravates the problem. To reduce the overall retraining cost, in this work, we introduce the concept of resilience-driven retraining amount selection. To realize this concept, we propose a novel framework, Reduce, that, at first, computes the resilience of the given DNN to faults at different fault rates and with different amounts of retraining. Then, based on the resilience, it computes the amount of retraining required for each chip considering its unique fault map. We demonstrate the effectiveness of our methodology for a systolic array-based DNN accelerator experiencing permanent faults in the computational array.
Impacts may have had a significant effect on the atmospheric chemistry of the early Earth. Reduced phases in the impactor (e.g., metallic iron) can reduce the planet's H$_2$O inventory to produce massive atmospheres rich in H$_2$. Whilst previous studies have focused on the interactions between the impactor and atmosphere in such scenarios, we investigate two further effects, 1) the distribution of the impactor's iron inventory during impact between the target interior, target atmosphere, and escaping the target, and 2) interactions between the post-impact atmosphere and the impact-generated melt phase. We find that these two effects can potentially counterbalance each other, with the melt-atmosphere interactions acting to restore reducing power to the atmosphere that was initially accreted by the melt phase. For a $\sim10^{22}\,\mathrm{kg}$ impactor, when the iron accreted by the melt phase is fully available to reduce this melt, we find an equilibrium atmosphere with H$_2$ column density $\sim10^4\,\mathrm{moles\,cm^{-2}}$ ($p\mathrm{H2}\sim120\,\mathrm{bars}\mathrm{,}~X_\mathrm{H2}\sim0.77$), consistent with previous estimates. However, when the iron is not available to reduce t
Reduced basis methods build low-rank approximation spaces for the solution sets of parameterized PDEs by computing solutions of the given PDE for appropriately selected snapshot parameters. Localized reduced basis methods reduce the offline cost of computing these snapshot solutions by instead constructing a global space from spatially localized less expensive problems. In the case of online enrichment, these local problems are iteratively solved in regions of high residual and correspond to subdomain solves in domain decomposition methods. We show in this note that indeed there is a close relationship between online-enriched localized reduced basis and domain decomposition methods by introducing a Localized Reduced Basis Additive Schwarz method (LRBAS), which can be interpreted as a locally adaptive multi-preconditioning scheme for the CG method.
In this paper, we analyze a hybridized discontinuous Galerkin(HDG) method with reduced stabilization for the Stokes equations. The reduced stabilization enables us to reduce the number of facet unknowns and improve the computational efficiency of the method. We provide optimal error estimates in an energy and $L^2$ norms. It is shown that the reduced method with the lowest-order approximation is closely related to the nonconforming Crouzeix-Raviart finite element method. We also prove that the solution of the reduced method converges to the nonconforming Gauss-Legendre finite element solution as a stabilization parameter $τ$ tends to infinity and that the convergence rate is $O(τ^{-1})$.
For computer vision applications, prior works have shown the efficacy of reducing the numeric precision of model parameters (network weights) in deep neural networks but also that reducing the precision of activations hurts model accuracy much more than reducing the precision of model parameters. We study schemes to train networks from scratch using reduced-precision activations without hurting the model accuracy. We reduce the precision of activation maps (along with model parameters) using a novel quantization scheme and increase the number of filter maps in a layer, and find that this scheme compensates or surpasses the accuracy of the baseline full-precision network. As a result, one can significantly reduce the dynamic memory footprint, memory bandwidth, computational energy and speed up the training and inference process with appropriate hardware support. We call our scheme WRPN - wide reduced-precision networks. We report results using our proposed schemes and show that our results are better than previously reported accuracies on ILSVRC-12 dataset while being computationally less expensive compared to previously reported reduced-precision networks.
The rapid increase of smoking-related diseases and deaths globally is driving us to find an effective approach to reduce the smoking rate. This study aims to determine whether indoor smoking bans at workplaces can effectively reduce the smoking rate. The Smokeban dataset used for this study is an observational dataset that contains some socio-demographic factors, whether people smoke, and whether smoking bans exist. Since the observational data used in the study did not randomize people into with-smoking-bans group and without-smoking-bans group, confounders may cause bias in the estimation of whether the smoking bans can reduce smoking rates. The propensity score matching(PSM) method can reduce these biases via using a logistic regression model to predict the similarities of people in those 2 groups and using the nearest neighbour matching technique to match people who are the most similar. After reducing the bias, another regression model was created to interpret the relationship between the probability of smoking and the indoor smoking bans. We conclude by arguing that with the existence of indoor smoking bans, the probability of people who smoke can be decreased greatly.
The isospectral reduction of matrix, which is closely related to its Schur complement, allows to reduce the size of a matrix while maintaining its eigenvalues up to a known set. Here we generalize this procedure by increasing the number of possible ways a matrix can be isospectrally reduced. The reduced matrix has rational functions as entries. We show that the notion of pseudospectrum can be extended to this class of matrices and that the pseudospectrum of a matrix shrinks as the matrix is reduced. Hence the eigenvalues of a reduced matrix are more robust to entry-wise perturbations than the eigenvalues of the original matrix. We also introduce the notion of inverse pseudospectrum (or pseudoresonances), which indicates how stable the poles of a matrix with rational function entries are to certain matrix perturbations. A mass spring system is used to illustrate and give a physical interpretation to both pseudospectra and inverse pseudospectra.
In Bayesian inverse problems sampling the posterior distribution is often a challenging task when the underlying models are computationally intensive. To this end, surrogates or reduced models are often used to accelerate the computation. However, in many practical problems, the parameter of interest can be of high dimensionality, which renders standard model reduction techniques infeasible. In this paper, we present an approach that employs the ANOVA decomposition method to reduce the model with respect to the unknown parameters, and the reduced basis method to reduce the model with respect to the physical parameters. Moreover, we provide an adaptive scheme within the MCMC iterations, to perform the ANOVA decomposition with respect to the posterior distribution. With numerical examples, we demonstrate that the proposed model reduction method can significantly reduce the computational cost of Bayesian inverse problems, without sacrificing much accuracy.
We study optimal control of diffusions with slow and fast variables and address a question raised by practitioners: is it possible to first eliminate the fast variables before solving the optimal control problem and then use the optimal control computed from the reduced-order model to control the original, high-dimensional system? The strategy "first reduce, then optimize"--rather than "first optimize, then reduce"--is motivated by the fact that solving optimal control problems for high-dimensional multiscale systems is numerically challenging and often computationally prohibitive. We state sufficient and necessary conditions, under which the "first reduce, then control" strategy can be employed and discuss when it should be avoided. We further give numerical examples that illustrate the "first reduce, then optmize" approach and discuss possible pitfalls.
This work presents a method to adaptively refine reduced-order models \emph{a posteriori} without requiring additional full-order-model solves. The technique is analogous to mesh-adaptive $h$-refinement: it enriches the reduced-basis space online by `splitting' a given basis vector into several vectors with disjoint support. The splitting scheme is defined by a tree structure constructed offline via recursive $k$-means clustering of the state variables using snapshot data. The method identifies the vectors to split online using a dual-weighted-residual approach that aims to reduce error in an output quantity of interest. The resulting method generates a hierarchy of subspaces online without requiring large-scale operations or full-order-model solves. Further, it enables the reduced-order model to satisfy \emph{any prescribed error tolerance} regardless of its original fidelity, as a completely refined reduced-order model is mathematically equivalent to the original full-order model. Experiments on a parameterized inviscid Burgers equation highlight the ability of the method to capture phenomena (e.g., moving shocks) not contained in the span of the original reduced basis.
The rapid growth of AI has fueled the expansion of accelerator- or GPU-based data centers. However, the rising operational energy consumption has emerged as a critical bottleneck and a major sustainability concern. Dynamic Voltage and Frequency Scaling (DVFS) is a well-known technique used to reduce energy consumption, and thus improve energy-efficiency, since it requires little effort and works with existing hardware. Reducing the energy consumption of training and inference of Large Language Models (LLMs) through DVFS or power capping is feasible: related work has shown energy savings can be significant, but at the cost of significant slowdowns. In this work, we focus on reducing waste in LLM operations: i.e., reducing energy consumption without losing performance. We propose a fine-grained, kernel-level, DVFS approach that explores new frequency configurations, and prove these save more energy than previous, pass- or iteration-level solutions. For example, for a GPT-3 training run, a pass-level approach could reduce energy consumption by 2% (without losing performance), while our kernel-level approach saves as much as 14.6% (with a 0.6% slowdown). We further investigate the effe