Flaky tests can make automated software testing unreliable due to their unpredictable behavior. These tests can pass or fail on the same code base on multiple runs. However, flaky tests often do not refer to any fault, even though they can cause the continuous integration (CI) pipeline to fail. A common type of flaky test is the order-dependent (OD) test. The outcome of an OD test depends on the order in which it is run with respect to other test cases. Several studies have explored the detection and repair of OD tests. However, their methods require re-runs of tests multiple times, that are not related to the order dependence. Hence, prioritizing potential OD tests is necessary to reduce the re-runs. In this paper, we propose a method to prioritize potential order-dependent tests. By analyzing shared static fields in test classes, we identify tests that are more likely to be order-dependent. In our experiment on 27 project modules, our method successfully prioritized all OD tests in 23 cases, reducing test executions by an average of 65.92% and unnecessary re-runs by 72.19%. These results demonstrate that our approach significantly improves the efficiency of OD test detection by l
Flaky tests yield different results when executed multiple times for the same version of the source code. Thus, they provide an ambiguous signal about the quality of the code and interfere with the automated assessment of code changes. While a variety of factors can cause test flakiness, approaches to fix flaky tests are typically tailored to address specific causes. However, the prevalent root causes of flaky tests can vary depending on the programming language, application domain, or size of the software project. Since manually labeling flaky tests is time-consuming and tedious, this work proposes an LLMs-as-annotators approach that leverages intra- and inter-model consistency to label issue reports related to fixed flakiness issues with the relevant root cause category. This allows us to gain an overview of prevalent flakiness categories in the issue reports. We evaluated our labeling approach in the context of SAP HANA, a large industrial database management system. Our results suggest that SAP HANA's tests most commonly suffer from issues related to concurrency (23%, 130 of 559 analyzed issue reports). Moreover, our results suggest that different test types face different flak
Testing for a mediation effect is important in many disciplines, but is made difficult - even asymptotically - by the influence of nuisance parameters. Classical tests such as likelihood ratio (LR) and Wald (Sobel) tests have very poor power properties in parts of the parameter space, and many attempts have been made to produce improved tests, with limited success. In this paper we show that augmenting the critical region of the LR test can produce a test with much improved behavior everywhere. In fact, we first show that there exists a test of this type that is (asymptotically) exact for certain test levels $α$, including the common choices $α=.01,.05,.10.$ The critical region of this exact test has some undesirable properties. We go on to show that there is a very simple class of augmented LR critical regions which provides tests that are nearly exact, and avoid the issues inherent in the exact test. We suggest an optimal and coherent member of this class, provide the table needed to implement the test and to report p-values if desired. Simulation confirms validity with non-Gaussian disturbances, under heteroskedasticity, and in a nonlinear (logit) model. A short application of t
Building on Ésik and Kuich's completeness result for finitely weighted Kleene algebra, we establish relational and language completeness results for finitely weighted Kleene algebra with tests. Similarly as Ésik and Kuich, we assume that the finite semiring of weights is commutative, partially ordered and zero-bounded, but we also assume that it is integral. We argue that finitely weighted Kleene algebra with tests is a natural framework for equational reasoning about weighted programs in cases where an upper bound on admissible weights is assumed.
Tests of gravity are important to the development of our understanding of gravitation and spacetime. Binary pulsars provide a superb playground for testing gravity theories. In this chapter we pedagogically review the basics behind pulsar observations and pulsar timing. We illustrate various recent strong-field tests of the general relativity (GR) from the Hulse-Taylor pulsar PSR B1913+16, the double pulsar PSR J0737$-$3039, and the triple pulsar PSR J0337+1715. We also overview the inner structure of neutron stars (NSs) that may influence some gravity tests, and have used the scalar-tensor gravity and massive gravity theories as examples to demonstrate the usefulness of pulsar timing in constraining specific modified gravity theories. Outlooks to new radio telescopes for pulsar timing and synergies with other strong-field gravity tests are also presented.
Following the 2019 release by the Event Horizon Telescope Collaboration of the first pictures of a supermassive black hole, there has been an explosion of interest in black hole images, their theoretical interpretation, and their potential use in tests of general relativity. The literature on the subject has now become so vast that an introductory guide for newcomers would appear welcome. Here, we aim to provide an accessible entry point to this growing field, with a particular focus on the black hole "photon ring": the bright, narrow ring of light that dominates images of a black hole and belongs to the black hole itself, rather than to its surrounding plasma. Far from an exhaustive review, this beginner's guide offers a pedagogical review of the key basic concepts and a brief summary of some results at the research frontier.
We review experimental tests of three-flavor (u,d,s) chiral perturbation theory (ChPT). These include measurements of pion and kaon polarizabilities, the chiral anomaly amplitudes for processes such as $γ\rightarrow πππ$, $γ\rightarrow ππη$, and $γ\rightarrow KKπ$; as well as the lifetimes of the neutral pion and eta. These observables are extracted primarily through ultra-peripheral Primakoff scattering of high-energy particles from virtual photons in the Coulomb field of nuclei. Comparing data to two-flavor and three-flavor predictions allows us to evaluate how well ChPT describes light meson dynamics and the role of the strange quark in spontaneous chiral symmetry breaking.
We investigate the large scale anomalies in the angular distribution of the cosmic microwave background radiation as measured by WMAP using several tests. These tests, based on the multipole vector expansion, measure correlations between the phases of the multipoles as expressed by the directions of the multipole vectors and their associated normal planes. We have computed the probability distribution functions for 46 such tests, for the multipoles l=2-5. We confirm earlier findings that point to a high level of alignment between l=2 (quadrupole) and l=3 (octopole), but with our tests we do not find significant planarity in the octopole. In addition, we have found other possible anomalies in the alignment between the octopole and the l=4 (hexadecupole) components, as well as in the planarity of l=4 and l=5. We introduce the notion of a total likelihood to estimate the relevance of the low-multipoles tests of non-gaussianity. We show that, as a result of these tests, the CMB maps which are most widely used for cosmological analysis lie within the ~ 10% of randomly generated maps with lowest likelihoods.
It is realized that existing powerful tests of goodness-of-fit are all based on sorted uniforms and, consequently, can suffer from the confounded effect of different locations and various signal frequencies in the deviations of the distributions under the alternative hypothesis from those under the null. This paper proposes circularly symmetric tests that are obtained by circularizing reweighted Anderson-Darling tests, with the focus on the circularized versions of Anderson-Darling and Zhang test statistics. Two specific types of circularization are considered, one is obtained by taking the average of the corresponding so-called scan test statistics and the other by using the maximum. To a certain extent, this circularization technique effectively eliminates the location effect and allows the weights to focus on the various signal frequencies. A limited but arguably convincing simulation study on finite-sample performance demonstrates that the circularized Zhang method outperforms the circularized Anderson-Darling and that the circularized tests outperform their parent methods. Large-sample theoretical results are also obtained for the average type of circularization. The results s
Kozen and Tiuryn have introduced the substructural logic $\mathsf{S}$ for reasoning about correctness of while programs (ACM TOCL, 2003). The logic $\mathsf{S}$ distinguishes between tests and partial correctness assertions, representing the latter by special implicational formulas. Kozen and Tiuryn's logic extends Kleene altebra with tests, where partial correctness assertions are represented by equations, not terms. Kleene algebra with codomain, $\mathsf{KAC}$, is a one-sorted alternative to Kleene algebra with tests that expands Kleene algebra with an operator that allows to construct a Boolean subalgebra of tests. In this paper we show that Kozen and Tiuryn's logic embeds into the equational theory of the expansion of $\mathsf{KAC}$ with residuals of Kleene algebra multiplication and the upper adjoint of the codomain operator.
Guided by the Einstein equivalence principle that identifies the phenomenon of gravitation as a manifestation of the dynamics of spacetime in contrast to a localizable force, we review and explore its consequences on formulating a theory of gravity. The resulting space of metric theories of gravity may address open conceptual and observational puzzles through a wealth of effects beyond general relativity, whose traces can be searched for within today's and tomorrow's gravitational testing grounds. Above all, we offer a generic metric theory generalization of Isaacson's approach to the leading-order field equations of physical perturbations with a well-defined notion of energy-momentum carried by the gravitational waves. Within this framework, we identify the backreaction of the Isaacson energy-momentum flux onto the background spacetime with the displacement memory effect that induces a permanent distortion of space after the passage of a gravitational wave. This effect is a well-known prediction of GR whose dominant contribution captures its inherent non-linear nature, manifest in the ability of gravity to gravitate. However, the novel interpretation of memory as naturally arising
The resolution of intelligence tests, in particular numerical sequences, has been of great interest in the evaluation of AI systems. We present a new computational model called KitBit that uses a reduced set of algorithms and their combinations to build a predictive model that finds the underlying pattern in numerical sequences, such as those included in IQ tests and others of much greater complexity. We present the fundamentals of the model and its application in different cases. First, the system is tested on a set of number series used in IQ tests collected from various sources. Next, our model is successfully applied on the sequences used to evaluate the models reported in the literature. In both cases, the system is capable of solving these types of problems in less than a second using standard computing power. Finally, KitBit's algorithms have been applied for the first time to the complete set of entire sequences of the well-known OEIS database. We find a pattern in the form of a list of algorithms and predict the following terms in the largest number of series to date. These results demonstrate the potential of KitBit to solve complex problems that could be represented nume
Several methods in survival analysis are based on the proportional hazards assumption. However, this assumption is very restrictive and often not justifiable in practice. Therefore, effect estimands that do not rely on the proportional hazards assumption are highly desirable in practical applications. One popular example for this is the restricted mean survival time (RMST). It is defined as the area under the survival curve up to a prespecified time point and, thus, summarizes the survival curve into a meaningful estimand. For two-sample comparisons based on the RMST, previous research found the inflation of the type I error of the asymptotic test for small samples and, therefore, a two-sample permutation test has already been developed. The first goal of the present paper is to further extend the permutation test for general factorial designs and general contrast hypotheses by considering a Wald-type test statistic and its asymptotic behavior. Additionally, a groupwise bootstrap approach is considered. Moreover, when a global test detects a significant difference by comparing the RMSTs of more than two groups, it is of interest which specific RMST differences cause the result. How
Most of the statistical tests currently used to detect differentially expressed genes are based on asymptotic results, and perform poorly for low expression tags. Another problem is the common use of a single canonical cutoff for the significance level (p-value) of all the tags, without taking into consideration the type II error and the highly variable character of the sample size of the tags. This work reports the development of two significance tests for the comparison of digital expression profiles, based on frequentist and Bayesian points of view, respectively. Both tests are exact, and do not use any asymptotic considerations, thus producing more correct results for low frequency tags than the chi-square test. The frequentist test uses a tag-customized critical level which minimizes a linear combination of type I and type II errors. A comparison of the Bayesian and the frequentist tests revealed that they are linked by a Beta distribution function. These tests can be used alone or in conjunction, and represent an improvement over the currently available methods for comparing digital profiles.
In 1859, Le Verrier discovered the Mercury perihelion advance anomaly. This anomaly turned out to be the first relativistic-gravity effect observed. During the 157 years to 2016, the precisions and accuracies of laboratory and space experiments, and of astrophysical and cosmological observations on relativistic gravity have been improved by 3-4 orders of magnitude. The improvements have been mainly from optical observations at first followed by radio observations. The achievements for the past 50 years are from radio Doppler tracking and radio ranging together with lunar laser ranging. At the present, the radio observations and lunar laser ranging experiments are similar in the accuracy of testing relativistic gravity. We review and summarize the present status of solar-system tests of relativistic gravity. With planetary laser ranging, spacecraft laser ranging and interferometric laser ranging (laser Doppler ranging) together with the development of drag-free technology, the optical observations will improve the accuracies by another 3-4 orders of magnitude in both the equivalence principle tests and solar-system dynamics tests of relativistic gravity. Clock tests and atomic inter
Encryption study basically deals with three levels of algorithms. The first algorithm deals with encryption mechanism, second deals with decryption Mechanism and the third discusses about the generation of keys and sub keys used in the encryption study. In the given study, a new algorithm is discussed. The algorithm executes a series of steps and generates a sequence. This sequence is being used as sub key to be mapped to plain text to generate cipher text. The strength of the encryption & Decryption process depends on the strength of sequence generated against crypto analysis.. In this part of work some statistical tests like Uniformity tests, Universal tests & Repetition tests are tried on the sequence generated to test the strength of it.
In this chapter, we discuss recent work on precision Earth laboratory tests of different aspects of gravity. In particular the discussion is focused on those tests that can be used to probe hypothesis for physics beyond Newtonian gravity and General Relativity. The latter includes tests of foundations like local Lorentz invariance, Weak-Equivalence Principle tests, short-range gravity tests, gravimeter-type tests, and other frontier possibilities like the free-fall of anti-matter and searches for non-Riemann gravity effects. The focus is on key results in theory, phenomenology, and experiment in the last few decades. We describe the motivations for continued interest in precision tests of gravity in the laboratory, including the possibility to search for physics beyond General Relativity. Test frameworks for describing deviations from General Relativity are emphasized, including ones based on effective field theory, allowing for generic violations of Lorentz symmetry, CPT symmetry, and diffeomorphism symmetry.
While Large Language Models (LLMs) have shown promise in software engineering, their application to unit testing remains largely confined to isolated test generation or oracle prediction, neglecting the broader challenge of test suite maintenance. We introduce TAM-Eval (Test Automated Maintenance Evaluation), a framework and benchmark designed to evaluate model performance across three core test maintenance scenarios: creation, repair, and updating of test suites. Unlike prior work limited to function-level tasks, TAM-Eval operates at the test file level, while maintaining access to full repository context during isolated evaluation, better reflecting real-world maintenance workflows. Our benchmark comprises 1,539 automatically extracted and validated scenarios from Python, Java, and Go projects. TAM-Eval supports system-agnostic evaluation of both raw LLMs and agentic workflows, using a reference-free protocol based on test suite pass rate, code coverage, and mutation testing. Empirical results indicate that state-of-the-art LLMs have limited capabilities in realistic test maintenance processes and yield only marginal improvements in test effectiveness. We release TAM-Eval as an o
General Relativity (GR) remains the most accurate theory of gravity to date. It has passed many experimental tests in the Solar System as well as binary pulsar, cosmological and gravitational-wave (GW) observations. Some of these tests probe regimes where gravitational fields are weak, the spacetime curvature is small, and the characteristic velocities are not comparable to the speed of light. Observations of compact binary coalescences enable us to test GR in extreme environments of strong and dynamical gravitational fields, large spacetime curvature, and velocities comparable to the speed of light. Since the breakthrough observation of the first GW signal produced by the merger of two black holes, GW150914, in September 2015, the number of confirmed detections of binary mergers has rapidly increased to nearly 100. The analysis of these events has already placed significant constraints on possible deviations from GR and on the nature of the coalescing compact objects. In this chapter, we discuss a selection of tests of GR applicable to observations of GWs from compact binaries. In particular, we will cover consistency tests, which check for consistency between the different phases
The detections of gravitational-wave (GW) signals from compact binary coalescence by ground-based detectors have opened up the era of GW astronomy. These observations provide opportunities to test Einstein's general theory of relativity at the strong-field regime. Here we give a brief overview of the various GW-based tests of General Relativity (GR) performed by the LIGO-Virgo collaboration on the detected GW events to date. After providing details for the tests performed in four categories, we discuss the prospects for each test in the context of future GW detectors. The four categories of tests include the consistency tests, parametrized tests for GW generation and propagation, tests for the merger remnant properties, and GW polarization tests.