Multi-step-ahead forecasts are often updated as new observations become available, since shorter forecast horizons typically improve forecast quality. However, such improvements come at the cost of forecast instability, i.e., variability in forecasts for the same target period. This instability can trigger costly changes to plans formulated based on the forecasts and may erode trust in the forecasting system. In this work, we integrate forecast stability alongside forecast quality into the training of distribution-free probabilistic time-series forecasting models, allowing us to control this trade-off. We propose a method for generating stabilized forecasted conditional quantile functions using regression splines parameterized by a neural network. This approach enables joint optimization of quality and stability, as it allows us to directly penalize dissimilarities arising from forecast updates. Furthermore, it allows assigning varying importance to stabilizing different parts of the forecast distributions (e.g., central parts vs. tails) to focus on the parts most relevant for the intended downstream use (e.g., the upper tail for inventory management). We empirically evaluate the p
This paper proposes corrected forecast combinations when the original combined forecast errors are serially dependent. Motivated by the classic Bates and Granger (1969) example, we show that combined forecast errors can be strongly autocorrelated and that a simple correction--adding a fraction of the previous combined error to the next-period combined forecast--can deliver sizable improvements in forecast accuracy, often exceeding the original gains from combining. We formalize the approach within the conditional risk framework of Gibbs and Vasnev (2024), in which the combined error decomposes into a predictable component (measurable at the forecast origin) and an innovation. We then link this correction to efficient estimation of combination weights under time-series dependence via GLS, allowing joint estimation of weights and an error-covariance structure. Using the U.S. Survey of Professional Forecasters for major macroeconomic indices across various subsamples (including pre and post-2000, GFC, and COVID), we find that a parsimonious correction of the mean forecast with a coefficient around 0.5 is a robust starting point and often yields material improvements in forecast accura
The problem of combining multiple forecasts of related quantities that obey expected equality and additivity constraints, often referred to a hierarchical forecast reconciliation, is naturally stated as a simple optimization problem. In this paper we explore optimization-based point forecast reconciliation at scales faced by large retailers. We implement and benchmark several algorithms to solve the forecast reconciliation problem, showing efficacy when the dimension of the problem exceeds four billion forecasted values. To the best of our knowledge, this is the largest forecast reconciliation problem, and perhaps on-par with the largest constrained least-squares-problem ever solved. We also make several theoretical contributions. We show that for a restricted class of problems and when the loss function is weighted appropriately, least-squares forecast reconciliation is equivalent to share-based forecast reconciliation. This formalizes how the optimization based approach can be thought of as a generalization of share-based reconciliation, applicable to multiple, overlapping data hierarchies.
This paper proposes using two metrics to quantify the forecastability of time series prior to model development: the spectral predictability score and the largest Lyapunov exponent. Unlike traditional model evaluation metrics, these measures assess the inherent forecastability characteristics of the data before any forecast attempts. The spectral predictability score evaluates the strength and regularity of frequency components in the time series, whereas the Lyapunov exponents quantify the chaos and stability of the system generating the data. We evaluated the effectiveness of these metrics on both synthetic and real-world time series from the M5 forecast competition dataset. Our results demonstrate that these two metrics can correctly reflect the inherent forecastability of a time series and have a strong correlation with the actual forecast performance of various models. By understanding the inherent forecastability of time series before model training, practitioners can focus their planning efforts on products and supply chain levels that are more forecastable, while setting appropriate expectations or seeking alternative strategies for products with limited forecastability.
Forecast reconciliation adjusts independently generated forecasts so that they satisfy some known constraints. While probabilistic forecast reconciliation is well established for linear constraints, some practical forecasting problems involve nonlinear relationships among variables. In this paper, we address probabilistic forecast reconciliation with nonlinear constraints for the first time. We extend both reconciliation via projection and conditioning to the case of nonlinear constraints. The projection approach reconciles forecast samples by mapping them onto the nonlinear coherent manifold. The conditioning approach adopts a sampling algorithm inspired to the Unscented Kalman Filter (UKF). We evaluate both methods on synthetic and real datasets. Empirically, both reconciliation approaches generally improve forecast accuracy. The UKF-based approach achieves the best overall performance while being substantially faster than the projection one.
Monitoring forecasting systems is critical for customer satisfaction, profitability, and operational efficiency in large-scale retail businesses. We propose The Forecast Critic, a system that leverages Large Language Models (LLMs) for automated forecast monitoring, taking advantage of their broad world knowledge and strong ``reasoning'' capabilities. As a prerequisite for this, we systematically evaluate the ability of LLMs to assess time series forecast quality, focusing on three key questions. (1) Can LLMs be deployed to perform forecast monitoring and identify obviously unreasonable forecasts? (2) Can LLMs effectively incorporate unstructured exogenous features to assess what a reasonable forecast looks like? (3) How does performance vary across model sizes and reasoning capabilities, measured across state-of-the-art LLMs? We present three experiments, including on both synthetic and real-world forecasting data. Our results show that LLMs can reliably detect and critique poor forecasts, such as those plagued by temporal misalignment, trend inconsistencies, and spike errors. The best-performing model we evaluated achieves an F1 score of 0.88, somewhat below human-level performanc
Electricity price forecasting is a critical tool for the efficient operation of power systems and for supporting informed decision-making by market participants. This paper explores a novel methodology aimed at improving the accuracy of electricity price forecasts by incorporating probabilistic inputs of fundamental variables. Traditional approaches often rely on point forecasts of exogenous variables such as load, solar, and wind generation. Our method proposes the integration of quantile forecasts of these fundamental variables, providing a new set of exogenous variables that account for a more comprehensive representation of uncertainty. We conducted empirical tests on the German electricity market using recent data to evaluate the effectiveness of this approach. The findings indicate that incorporating probabilistic forecasts of load and renewable energy source generation significantly improves the accuracy of point forecasts of electricity prices. Furthermore, the results clearly show that the highest improvement in forecast accuracy can be achieved with full probabilistic forecast information. This highlights the importance of probabilistic forecasting in research and practic
Recent advances in machine learning have produced probabilistic weather forecasting models comparable to state-of-the-art numerical weather predictors. But no model consistently dominates spatio-temporally, and relative performance is highly context-dependent. This motivates adaptive methods for combining multiple forecasts to obtain improvements and robustness. While combined forecasts have been proposed in the literature, these are achieved either through supervised learning or through prediction with expert advice methods. We introduce AdaWeather, an adaptive framework that combines many probabilistic forecasts using both machine learning as well as mixture of experts to arrive at a unified improved probabilistic forecast. While traditional expert methods develop the regret bounds with respect to the best single expert in hindsight, we extend the algorithm and analysis to show our method has logarithmic regret compared to the best static mixture of experts in hindsight. Empirically, we focus on forecasting temperature, and observe improvements over existing methods.
In many organisations, accurate forecasts are essential for making informed decisions for a variety of applications from inventory management to staffing optimization. Whatever forecasting model is used, changes in the underlying process can lead to inaccurate forecasts, which will be damaging to decision-making. At the same time, models are becoming increasingly complex and identifying change through direct modelling is problematic. We present a novel framework for online monitoring of forecasts to ensure they remain accurate. By utilizing sequential changepoint techniques on the forecast errors, our framework allows for the real-time identification of potential changes in the process caused by various external factors. We show theoretically that some common changes in the underlying process will manifest in the forecast errors and can be identified faster by identifying shifts in the forecast errors than within the original modelling framework. Moreover, we demonstrate the effectiveness of this framework on numerous forecasting approaches through simulations and show its effectiveness over alternative approaches. Finally, we present two concrete examples, one from Royal Mail parc
Battery energy storage systems (BESS) participating in multi-market electricity trading require price forecasts to optimize dispatch decisions. A widely held assumption is that forecast accuracy, measured by standard metrics such as mean absolute error (MAE), drives trading performance. We challenge this assumption using a hierarchical three-layer optimization system trading simultaneously on frequency containment reserve (FCR), automatic frequency restoration reserve (aFRR), day-ahead, and continuous intraday (XBID) markets in Germany and Switzerland over 2020-2025, with real market data from Regelleistung.net and Swissgrid. We find that rank correlation (Kendall tau), rather than MAE, is the primary predictor of intraday dispatch value: forecasts above an empirical threshold of tau approximately 0.85-0.95 capture up to 97-100% of perfect-foresight revenue, while persistence forecasts with near-zero tau capture only 33%. This threshold is stable across market regimes and volatility levels, and reflects the ordinal structure of the dispatch problem. Furthermore, under reserve market constraints, FCR capacity revenue exceeds XBID by 6.5x per MW, making capacity allocation -- not for
Short-term load forecasting is a critical element of power systems energy management systems. In recent years, probabilistic load forecasting (PLF) has gained increased attention for its ability to provide uncertainty information that helps to improve the reliability and economics of system operation performances. This paper proposes a two-stage probabilistic load forecasting framework by integrating point forecast as a key probabilistic forecasting feature into PLF. In the first stage, all related features are utilized to train a point forecast model and also obtain the feature importance. In the second stage the forecasting model is trained, taking into consideration point forecast features, as well as selected feature subsets. During the testing period of the forecast model, the final probabilistic load forecast results are leveraged to obtain both point forecasting and probabilistic forecasting. Numerical results obtained from ISO New England demand data demonstrate the effectiveness of the proposed approach in the hour-ahead load forecasting, which uses the gradient boosting regression for the point forecasting and quantile regression neural networks for the probabilistic fore
A novel forecast linear augmented projection (FLAP) method is introduced, which reduces the forecast error variance of any unbiased multivariate forecast without introducing bias. The method first constructs new component series which are linear combinations of the original series. Forecasts are then generated for both the original and component series. Finally, the full vector of forecasts is projected onto a linear subspace where the constraints implied by the combination weights hold. It is proven that the trace of the forecast error variance is non-increasing with the number of components, and mild conditions are established for which it is strictly decreasing. It is also shown that the proposed method achieves maximum forecast error variance reduction among linear projection methods. The theoretical results are validated through simulations and two empirical applications based on Australian tourism and FRED-MD data. Notably, using FLAP with Principal Component Analysis (PCA) to construct the new series leads to substantial forecast error variance reduction.
Distributed, small-scale solar photovoltaic (PV) systems are being installed at a rapidly increasing rate. This can cause major impacts on distribution networks and energy markets. As a result, there is a significant need for improved forecasting of the power generation of these systems at different time resolutions and horizons. However, the performance of forecasting models depends on the resolution and horizon. Forecast combinations (ensembles), that combine the forecasts of multiple models into a single forecast may be robust in such cases. Therefore, in this paper, we provide comparisons and insights into the performance of five state-of-the-art forecast models and existing forecast combinations at multiple resolutions and horizons. We propose a forecast combination approach based on particle swarm optimization (PSO) that will enable a forecaster to produce accurate forecasts for the task at hand by weighting the forecasts produced by individual models. Furthermore, we compare the performance of the proposed combination approach with existing forecast combination approaches. A comprehensive evaluation is conducted using a real-world residential PV power data set measured at 25
This paper discusses three key themes in forecasting for monetary policy highlighted in the Bernanke (2024) review: the challenges in economic forecasting, the conditional nature of central bank forecasts, and the importance of forecast evaluation. In addition, a formal evaluation of the Bank of England's inflation forecasts indicates that, despite the large forecast errors in recent years, they were still accurate relative to common benchmarks.
Retail sales and price projections are typically based on time series forecasting. For some product categories, the accuracy of demand forecasts achieved is low, negatively impacting inventory, transport, and replenishment planning. This paper presents our findings based on a proactive pilot exercise to explore ways to help retailers to improve forecast accuracy for such product categories. We evaluated opportunities for algorithmic interventions to improve forecast accuracy based on a sample product category, Knitwear. The Knitwear product category has a current demand forecast accuracy from non-AI models in the range of 60%. We explored how to improve the forecast accuracy using a rack approach. To generate forecasts, our decision model dynamically selects the best algorithm from an algorithm rack based on performance for a given state and context. Outcomes from our AI/ML forecasting model built using advanced feature engineering show an increase in the accuracy of demand forecast for Knitwear product category by 20%, taking the overall accuracy to 80%. Because our rack comprises algorithms that cater to a range of customer data sets, the forecasting model can be easily tailored
Modern weather forecasts are commonly issued as consistent multi-day forecast trajectories with a time resolution of 1-3 hours. Prior to issuing, statistical post-processing is routinely used to correct systematic errors and misrepresentations of the forecast uncertainty. However, once the forecast has been issued, it is rarely updated before it is replaced in the next forecast cycle of the numerical weather prediction (NWP) model. This paper shows that the error correlation structure within the forecast trajectory can be utilized to substantially improve the forecast between the NWP forecast cycles by applying additional post-processing steps each time new observations become available. The proposed rapid adjustment is applied to temperature forecast trajectories from the UK Met Office's convective-scale ensemble MOGREPS-UK. MOGREPS-UK is run four times daily and produces hourly forecasts for up to 36 hours ahead. Our results indicate that the rapidly adjusted forecast from the previous NWP forecast cycle outperforms the new forecast for the first few hours of the next cycle, or until the new forecast itself can be rapidly adjusted, suggesting a new strategy for updating the forec
In financial time series forecasting, the naive forecast is a notoriously difficult benchmark to surpass because of the stochastic nature of the data. Motivated by this challenge, this study introduces the movement prediction-adjusted naive forecast (MPANF), a forecast combination method that systematically refines the naive forecast by incorporating directional information. In particular, MPANF adjusts the naive forecast with an increment formed by three components: the in-sample mean absolute increment as the base magnitude, the movement prediction as the sign, and a coefficient derived from the in-sample movement prediction accuracy as the scaling factor. The experimental results on eight financial time series, using the RMSE, MAE, MAPE, and sMAPE, show that with a movement prediction accuracy of approximately 0.55, MPANF generally outperforms common benchmarks, including the naive forecast, naive forecast with drift, IMA(1,1), and linear regression. These findings indicate that MPANF has the potential to outperform the naive baseline when reliable movement predictions are available.
The path toward realizing the potential of seasonal forecasting and its socioeconomic benefits depends heavily on improving general circulation model based dynamical forecasting systems. To improve dynamical seasonal forecast, it is crucial to set up forecast benchmarks, and clarify forecast limitations posed by model initialization errors, formulation deficiencies, and internal climate variability. With huge cost in generating large forecast ensembles, and limited observations for forecast verification, the seasonal forecast benchmarking and diagnosing task proves challenging. In this study, we develop a probabilistic deep neural network model, drawing on a wealth of existing climate simulations to enhance seasonal forecast capability and forecast diagnosis. By leveraging complex physical relationships encoded in climate simulations, our probabilistic forecast model demonstrates favorable deterministic and probabilistic skill compared to state-of-the-art dynamical forecast systems in quasi-global seasonal forecast of precipitation and near-surface temperature. We apply this probabilistic forecast methodology to quantify the impacts of initialization errors and model formulation de
Many autonomous systems forecast aspects of the future in order to aid decision-making. For example, self-driving vehicles and robotic manipulation systems often forecast future object poses by first detecting and tracking objects. However, this detect-then-forecast pipeline is expensive to scale, as pose forecasting algorithms typically require labeled sequences of object poses, which are costly to obtain in 3D space. Can we scale performance without requiring additional labels? We hypothesize yes, and propose inverting the detect-then-forecast pipeline. Instead of detecting, tracking and then forecasting the objects, we propose to first forecast 3D sensor data (e.g., point clouds with $100$k points) and then detect/track objects on the predicted point cloud sequences to obtain future poses, i.e., a forecast-then-detect pipeline. This inversion makes it less expensive to scale pose forecasting, as the sensor data forecasting task requires no labels. Part of this work's focus is on the challenging first step -- Sequential Pointcloud Forecasting (SPF), for which we also propose an effective approach, SPFNet. To compare our forecast-then-detect pipeline relative to the detect-then-fo
Volatility forecasts are key inputs in financial analysis. While lasso based forecasts have shown to perform well in many applications, their use to obtain volatility forecasts has not yet received much attention in the literature. Lasso estimators produce parsimonious forecast models. Our forecast combination approach hedges against the risk of selecting a wrong degree of model parsimony. Apart from the standard lasso, we consider several lasso extensions that account for the dynamic nature of the forecast model. We apply forecast combined lasso estimators in a comprehensive forecasting exercise using realized variance time series of ten major international stock market indices. We find the lasso extended "ordered lasso" to give the most accurate realized variance forecasts. Multivariate forecast models, accounting for volatility spillovers between different stock markets, outperform univariate forecast models for longer forecast horizons.