Data science has become increasingly essential for the production of official statistics, as it enables the automated collection, processing, and analysis of large amounts of data. With such data science practices in place, it enables more timely, more insightful and more flexible reporting. However, the quality and integrity of data-science-driven statistics rely on the accuracy and reliability of the data sources and the machine learning techniques that support them. In particular, changes in data sources are inevitable to occur and pose significant risks that are crucial to address in the context of machine learning for official statistics. This paper gives an overview of the main risks, liabilities, and uncertainties associated with changing data sources in the context of machine learning for official statistics. We provide a checklist of the most prevalent origins and causes of changing data sources; not only on a technical level but also regarding ownership, ethics, regulation, and public perception. Next, we highlight the repercussions of changing data sources on statistical reporting. These include technical effects such as concept drift, bias, availability, validity, accur
We obtain two theorems extending the use of a saddlepoint approximation to multiparameter problems for likelihood ratio-like statistics which allow their use in permutation and rank tests and could be used in bootstrap approximations. In the first, we show that in some cases when no density exists, the integral of the formal saddlepoint density over the set corresponding to large values of the likelihood ratio-like statistic approximates the true probability with relative error of order $1/n$. In the second, we give multivariate generalizations of the Lugannani--Rice and Barndorff-Nielsen or $r^*$ formulas for the approximations. These theorems are applied to obtain permutation tests based on the likelihood ratio-like statistics for the $k$ sample and the multivariate two-sample cases. Numerical examples are given to illustrate the high degree of accuracy, and these statistics are compared to the classical statistics in both cases.
Two-sample $U$-statistics are widely used in a broad range of applications, including those in the fields of biostatistics and econometrics. In this paper, we establish sharp Cramér-type moderate deviation theorems for Studentized two-sample $U$-statistics in a general framework, including the two-sample $t$-statistic and Studentized Mann-Whitney test statistic as prototypical examples. In particular, a refined moderate deviation theorem with second-order accuracy is established for the two-sample $t$-statistic. These results extend the applicability of the existing statistical methodologies from the one-sample $t$-statistic to more general nonlinear statistics. Applications to two-sample large-scale multiple testing problems with false discovery rate control and the regularized bootstrap method are also discussed.
We provide a unifying framework linking two classes of statistics used in two-sample and independence testing: on the one hand, the energy distances and distance covariances from the statistics literature; on the other, maximum mean discrepancies (MMD), that is, distances between embeddings of distributions to reproducing kernel Hilbert spaces (RKHS), as established in machine learning. In the case where the energy distance is computed with a semimetric of negative type, a positive definite kernel, termed distance kernel, may be defined such that the MMD corresponds exactly to the energy distance. Conversely, for any positive definite kernel, we can interpret the MMD as energy distance with respect to some negative-type semimetric. This equivalence readily extends to distance covariance using kernels on the product space. We determine the class of probability distributions for which the test statistics are consistent against all alternatives. Finally, we investigate the performance of the family of distance kernels in two-sample and independence tests: we show in particular that the energy distance most commonly employed in statistics is just one member of a parametric family of ke
I first met Leo Breiman in 1979 at the beginning of his third career, Professor of Statistics at Berkeley. He obtained his PhD with Loéve at Berkeley in 1957. His first career was as a probabilist in the Mathematics Department at UCLA. After distinguished research, including the Shannon--Breiman--MacMillan Theorem and getting tenure, he decided that his real interest was in applied statistics, so he resigned his position at UCLA and set up as a consultant. Before doing so he produced two classic texts, Probability, now reprinted as a SIAM Classic in Applied Mathematics, and Statistics. Both books reflected his strong opinion that intuition and rigor must be combined. He expressed this in his probability book which he viewed as a combination of his learning the right hand of probability, rigor, from Loéve, and the left-hand, intuition, from David Blackwell.
With the possible exception of gambling, meteorology, particularly precipitation forecasting, may be the area with which the general public is most familiar with probabilistic assessments of uncertainty. Despite the heavy use of stochastic models and statistical methods in weather forecasting and other areas of the atmospheric sciences, papers in these areas have traditionally been somewhat uncommon in statistics journals. We see signs of this changing in recent years and we have sought to highlight some present research directions at the interface of statistics and the atmospheric sciences in this special section.
The development of wavelet theory has in recent years spawned applications in signal processing, in fast algorithms for integral transforms, and in image and function representation methods. This last application has stimulated interest in wavelet applications to statistics and to the analysis of experimental data, with many successes in the efficient analysis, processing, and compression of noisy signals and images. This is a selective review article that attempts to synthesize some recent work on ``nonlinear'' wavelet methods in nonparametric curve estimation and their role on a variety of applications. After a short introduction to wavelet theory, we discuss in detail several wavelet shrinkage and wavelet thresholding estimators, scattered in the literature and developed, under more or less standard settings, for density estimation from i.i.d. observations or to denoise data modeled as observations of a signal with additive noise. Most of these methods are fitted into the general concept of regularization with appropriately chosen penalty functions. A narrow range of applications in major areas of statistics is also discussed such as partial linear regression models and function
We consider the detection of multivariate spatial clusters in the Bernoulli model with $N$ locations, where the design distribution has weakly dependent marginals. The locations are scanned with a rectangular window with sides parallel to the axes and with varying sizes and aspect ratios. Multivariate scan statistics pose a statistical problem due to the multiple testing over many scan windows, as well as a computational problem because statistics have to be evaluated on many windows. This paper introduces methodology that leads to both statistically optimal inference and computationally efficient algorithms. The main difference to the traditional calibration of scan statistics is the concept of grouping scan windows according to their sizes, and then applying different critical values to different groups. It is shown that this calibration of the scan statistic results in optimal inference for spatial clusters on both small scales and on large scales, as well as in the case where the cluster lives on one of the marginals. Methodology is introduced that allows for an efficient approximation of the set of all rectangles while still guaranteeing the statistical optimality results desc
Statistical depth measures the centrality of a point with respect to a given distribution or data cloud. It provides a natural center-outward ordering of multivariate data points and yields a systematic nonparametric multivariate analysis scheme. In particular, the half-space depth is shown to have many desirable properties and broad applicability. However, the empirical half-space depth is zero outside the convex hull of the data. This property has rendered the empirical half-space depth useless outside the data cloud, and limited its utility in applications where the extreme outlying probability mass is the focal point, such as in classification problems and control charts with very small false alarm rates. To address this issue, we apply extreme value statistics to refine the empirical half-space depth in "the tail." This provides an important linkage between data depth, which is useful for inference on centrality, and extreme value statistics, which is useful for inference on extremity. The refined empirical half-space depth can thus extend all its utilities beyond the data cloud, and hence broaden greatly its applicability. The refined estimator is shown to have substantially
We investigate the behavior of Fourier transforms for a wide class of nonstationary nonlinear processes. Asymptotic central and noncentral limit theorems are established for a class of nondegenerate and degenerate weighted $V$-statistics through the angle of Fourier analysis. The established theory for $V$-statistics provides a unified treatment for many important time and spectral domain problems in the analysis of nonstationary time series, ranging from nonparametric estimation to the inference of periodograms and spectral densities.
In my 2011 Annals of Applied Statistics article [Goerg (2011)] I wrote that "Whereas the Lambert $W$ function plays an important role in mathematics, physics, chemistry, biology and other fields, it has not yet been used in statistics." This was incorrect. At the time of publication I was unaware of Stehlík (2003), who used the Lambert $W$ function to derive the exact distribution of the likelihood ratio test statistic. He has also used it in more recent work such as Stehlík (2006), Stehlík et al. (2010), or Stehlík (2014) amongst others. While Stehlík's use of the Lambert $W$ function was focused on the distribution of the likelihood ratio test statistic, my work dealt with the modeling of skewed random variables and symmetrizing data using the Lambert $W$ function as a variable transformation.
Our data are random fields of multivariate Gaussian observations, and we fit a multivariate linear model with common design matrix at each point. We are interested in detecting those points where some of the coefficients are nonzero using classical multivariate statistics evaluated at each point. The problem is to find the $P$-value of the maximum of such a random field of test statistics. We approximate this by the expected Euler characteristic of the excursion set. Our main result is a very simple method for calculating this, which not only gives us the previous result of Cao and Worsley [Ann. Statist. 27 (1999) 925--942] for Hotelling's $T^2$, but also random fields of Roy's maximum root, maximum canonical correlations [Ann. Appl. Probab. 9 (1999) 1021--1057], multilinear forms [Ann. Statist. 29 (2001) 328--371], $\barχ^2$ [Statist. Probab. Lett 32 (1997) 367--376, Ann. Statist. 25 (1997) 2368--2387] and $χ^2$ scale space [Adv. in Appl. Probab. 33 (2001) 773--793]. The trick involves approaching the problem from the point of view of Roy's union-intersection principle. The results are applied to a problem in shape analysis where we look for brain damage due to nonmissile trauma.
Spectral sampling is associated with the group of unitary transformations acting on matrices in much the same way that simple random sampling is associated with the symmetric group acting on vectors. This parallel extends to symmetric functions, k-statistics and polykays. We construct spectral k-statistics as unbiased estimators of cumulants of trace powers of a suitable random matrix. Moreover we define normalized spectral polykays in such a way that when the sampling is from an infinite population they return products of free cumulants.
The classical theory of rank-based inference is entirely based either on ordinary ranks, which do not allow for considering location (intercept) parameters, or on signed ranks, which require an assumption of symmetry. If the median, in the absence of a symmetry assumption, is considered as a location parameter, the maximal invariance property of ordinary ranks is lost to the ranks and the signs. This new maximal invariant thus suggests a new class of statistics, based on ordinary ranks and signs. An asymptotic representation theory à la Hájek is developed here for such statistics, both in the nonserial and in the serial case. The corresponding asymptotic normality results clearly show how the signs add a separate contribution to the asymptotic variance, hence, potentially, to asymptotic efficiency. As shown by Hallin and Werker [Bernoulli 9 (2003) 137--165], conditioning in an appropriate way on the maximal invariant potentially even leads to semiparametrically efficient inference. Applications to semiparametric inference in regression and time series models with median restrictions are treated in detail in an upcoming companion paper.
Generalized likelihood ratio (GLR) test statistics are often used in the detection of spatial clustering in case-control and case-population datasets to check for a significantly large proportion of cases within some scanning window. The traditional spatial scan test statistic takes the supremum GLR value over all windows, whereas the average likelihood ratio (ALR) test statistic that we consider here takes an average of the GLR values. Numerical experiments in the literature and in this paper show that the ALR test statistic has more power compared to the spatial scan statistic. We develop in this paper accurate tail probability approximations of the ALR test statistic that allow us to by-pass computer intensive Monte Carlo procedures to estimate $p$-values. In models that adjust for covariates, these Monte Carlo evaluations require an initial fitting of parameters that can result in very biased $p$-value estimates.
Let $\mathbf{Q}=(Q_1,\ldots,Q_n)$ be a random vector drawn from the uniform distribution on the set of all $n!$ permutations of $\{1,2,\ldots,n\}$. Let $\mathbf{Z}=(Z_1,\ldots,Z_n)$, where $Z_j$ is the mean zero variance one random variable obtained by centralizing and normalizing $Q_j$, $j=1,\ldots,n$. Assume that $\mathbf {X}_i,i=1,\ldots ,p$ are i.i.d. copies of $\frac{1}{\sqrt{p}}\mathbf{Z}$ and $X=X_{p,n}$ is the $p\times n$ random matrix with $\mathbf{X}_i$ as its $i$th row. Then $S_n=XX^*$ is called the $p\times n$ Spearman's rank correlation matrix which can be regarded as a high dimensional extension of the classical nonparametric statistic Spearman's rank correlation coefficient between two independent random variables. In this paper, we establish a CLT for the linear spectral statistics of this nonparametric random matrix model in the scenario of high dimension, namely, $p=p(n)$ and $p/n\to c\in(0,\infty)$ as $n\to\infty$. We propose a novel evaluation scheme to estimate the core quantity in Anderson and Zeitouni's cumulant method in [Ann. Statist. 36 (2008) 2553-2576] to bypass the so-called joint cumulant summability. In addition, we raise a two-step comparison approac
In this paper, we derive valid Edgeworth expansions for studentized versions of a large class of statistics when the data are generated by a strongly mixing process. Under dependence, the asymptotic variance of such a statistic is given by an infinite series of lag-covariances, and therefore, studentizing factors (i.e., estimators of the asymptotic standard error) typically involve an increasing number, say, $\ell$ of lag-covariance estimators, which are themselves quadratic functions of the observations. The unboundedness of the dimension $\ell$ of these quadratic functions makes the derivation and the form of the expansions nonstandard. It is shown that in contrast to the case of the studentized means under independence, the derived Edgeworth expansion is a superposition of three distinct series, respectively, given by one in powers of $n^{-1/2}$, one in powers of $[n/\ell]^{-1/2}$ (resulting from the standard error of the studentizing factor) and one in powers of the bias of the studentizing factor, where $n$ denotes the sample size.
A general rate estimation method is proposed that is based on studying the in-sample evolution of appropriately chosen diverging/converging statistics. The proposed rate estimators are based on simple least squares arguments, and are shown to be accurate in a very general setting without requiring the choice of a tuning parameter. The notion of scanning is introduced with the purpose of extracting useful subsamples of the data series; the proposed rate estimation method is applied to different scans, and the resulting estimators are then combined to improve accuracy. Applications to heavy tail index estimation as well as to the problem of estimating the long memory parameter are discussed; a small simulation study complements our theoretical results.
The purpose of this paper is to investigate and develop methods for analysis of multi-center randomized clinical trials which only rely on the randomization process as a basis of inference. Our motivation is prompted by the fact that most current statistical procedures used in the analysis of randomized multi-center studies are model based. The randomization feature of the trials is usually ignored. An important characteristic of model based analysis is that it is straightforward to model covariates. Nevertheless, in nearly all model based analyses, the effects due to different centers and, in general, the design of the clinical trials are ignored. An alternative to a model based analysis is to have analyses guided by the design of the trial. Our development of design based methods allows the incorporation of centers as well as other features of the trial design. The methods make use of conditioning on the ancillary statistics in the sample space generated by the randomization process. We have investigated the power of the methods and have found that, in the presence of center variation, there is a significant increase in power. The methods have been extended to group sequential tr
What makes a problem suitable for statistical analysis? Are historical and religious questions addressable using statistical calculations? Such issues have long been debated in the statistical community and statisticians and others have used historical information and texts to analyze such questions as the economics of slavery, the authorship of the Federalist Papers and the question of the existence of God. But what about historical and religious attributions associated with information gathered from archeological finds? In 1980, a construction crew working in the Jerusalem neighborhood of East Talpiot stumbled upon a crypt. Archaeologists from the Israel Antiquities Authority came to the scene and found 10 limestone burial boxes, known as ossuaries, in the crypt. Six of these had inscriptions. The remains found in the ossuaries were reburied, as required by Jewish religious tradition, and the ossuaries were catalogued and stored in a warehouse. The inscriptions on the ossuaries were catalogued and published by Rahmani (1994) and by Kloner (1996) but there reports did not receive widespread public attention. Fast forward to March 2007, when a television ``docudrama'' aired on The