共找到 20 条结果
The categorical Gini covariance is a dependence measure between a numerical variable and a categorical variable. The Gini covariance measures dependence by quantifying the difference between the conditional and unconditional distributional functions. The categorical Gini covariance equals zero if and only if the numerical variable and the categorical variable are independent. We propose a non-parametric test for testing the independence between a numerical and categorical variable using a modified categorical Gini covariance. We used the theory of U-statistics to find the test statistics and study the properties. The test has an asymptotic normal distribution. Since the implementation of a normal-based test is difficult, we develop a jackknife empirical likelihood (JEL) ratio test for testing independence. Extensive Monte Carlo simulation studies are carried out to validate the performance of the proposed JEL ratio test. We illustrate the test procedure using Iris flower data set.
Intelligent assistants powered by Large Language Models (LLMs) can generate program and test code with high accuracy, boosting developers' and testers' productivity. However, there is a lack of studies exploring LLMs for testing Web APIs, which constitute fundamental building blocks of modern software systems and pose significant test challenges. Hence, in this article, we introduce APITestGenie, an approach and tool that leverages LLMs to generate executable API test scripts from business requirements and API specifications. In experiments with 10 real-world APIs, the tool generated valid test scripts 57% of the time. With three generation attempts per task, this success rate increased to 80%. Human intervention is recommended to validate or refine generated scripts before integration into CI/CD pipelines, positioning our tool as a productivity assistant rather than a replacement for testers. Feedback from industry specialists indicated a strong interest in adopting our tool for improving the API test process.
This paper introduces a decision-theoretic framework for constructing and evaluating test statistics based on their relationship with ancillary statistics-quantities whose distributions remain fixed under the null and alternative hypotheses. Rather than focusing solely on maximizing discriminatory power, the proposed approach emphasizes reducing dependence between a test statistic and relevant ancillary structures. We show that minimizing such dependence can yield most powerful (MP) procedures. A Basu-type independence result is established, and we demonstrate that certain MP statistics also characterize the underlying data distribution. The methodology is illustrated through modifications of classical nonparametric tests, including the Shapiro-Wilk, Anderson-Darling, and Kolmogorov-Smirnov tests, as well as a test for the center of symmetry. Simulation studies highlight the power and robustness of the proposed procedures. The framework is computationally simple and offers a principled strategy for improving statistical testing.
The justification for the "test point" derives from the test pilot's obligation to reproduce faithfully the pre-specified conditions of some model prediction. Pilot deviation from those conditions invalidates the model assumptions. Flight test aids have been proposed to increase accuracy on more challenging test points. However, the very existence of databands and tolerances is the problem more fundamental than inadequate pilot skill. We propose a novel approach, which eliminates test points. We start with a high-fidelity digital model of an air vehicle. Instead of using this model to generate a point prediction, we use a machine learning method to produce a reduced-order model (ROM). The ROM has two important properties. First, it can generate a prediction based on any set of conditions the pilot flies. Second, if the test result at those conditions differ from the prediction, the ROM can be updated using the new data. The outcome of flight test is thus a refined ROM at whatever conditions were flown. This ROM in turn updates and validates the high-fidelity model. We present a single example of this "point-less" architecture, using T-38C flight test data. We first use a generic ai
In unsupervised combinatorial optimization (UCO), during training, one aims to have continuous decisions that are promising in a probabilistic sense for each training instance, which enables end-to-end training on initially discrete and non-differentiable problems. At the test time, for each test instance, starting from continuous decisions, derandomization is typically applied to obtain the final deterministic decisions. Researchers have developed more and more powerful test-time derandomization schemes to enhance the empirical performance and the theoretical guarantee of UCO methods. However, we notice a misalignment between training and testing in the existing UCO methods. Consequently, lower training losses do not necessarily entail better post-derandomization performance, even for the training instances without any data distribution shift. Empirically, we indeed observe such undesirable cases. We explore a preliminary idea to better align training and testing in UCO by including a differentiable version of derandomization into training. Our empirical exploration shows that such an idea indeed improves training-test alignment, but also introduces nontrivial challenges into trai
How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erdős' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysi
Log symmetric distributions are useful in modeling data which show high skewness and have found applications in various fields. Using a recent characterization for log symmetric distributions, we propose a goodness of fit test for testing log symmetry. The asymptotic distributions of the test statistics under both null and alternate distributions are obtained. As the normal-based test is difficult to implement, we also propose a jackknife empirical likelihood (JEL) ratio test for testing log symmetry. We conduct a Monte Carlo Simulation to evaluate the performance of the JEL ratio test. Finally, we illustrated our methodology using different data sets.
Machine learning methods strive to acquire a robust model during the training process that can effectively generalize to test samples, even in the presence of distribution shifts. However, these methods often suffer from performance degradation due to unknown test distributions. Test-time adaptation (TTA), an emerging paradigm, has the potential to adapt a pre-trained model to unlabeled data during testing, before making predictions. Recent progress in this paradigm has highlighted the significant benefits of using unlabeled data to train self-adapted models prior to inference. In this survey, we categorize TTA into several distinct groups based on the form of test data, namely, test-time domain adaptation, test-time batch adaptation, and online test-time adaptation. For each category, we provide a comprehensive taxonomy of advanced algorithms and discuss various learning scenarios. Furthermore, we analyze relevant applications of TTA and discuss open challenges and promising areas for future research. For a comprehensive list of TTA methods, kindly refer to \url{https://github.com/tim-learn/awesome-test-time-adaptation}.
We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture -- a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model's initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available.
With the advent of WWW and outburst in technology and software development, testing the software became a major concern. Due to the importance of the testing phase in a software development life cycle, testing has been divided into graphical user interface (GUI) based testing, logical testing, integration testing, etc.GUI Testing has become very important as it provides more sophisticated way to interact with the software. The complexity of testing GUI increased over time. The testing needs to be performed in a way that it provides effectiveness, efficiency, increased fault detection rate and good path coverage. To cover all use cases and to provide testing for all possible (success/failure) scenarios the length of the test sequence is considered important. Intent of this paper is to study some techniques used for test case generation and process for various GUI based software applications.
Test-time adaptation (TTA) refers to adapting a trained model to a new domain during testing. Existing TTA techniques rely on having multiple test images from the same domain, yet this may be impractical in real-world applications such as medical imaging, where data acquisition is expensive and imaging conditions vary frequently. Here, we approach such a task, of adapting a medical image segmentation model with only a single unlabeled test image. Most TTA approaches, which directly minimize the entropy of predictions, fail to improve performance significantly in this setting, in which we also observe the choice of batch normalization (BN) layer statistics to be a highly important yet unstable factor due to only having a single test domain example. To overcome this, we propose to instead integrate over predictions made with various estimates of target domain statistics between the training and test statistics, weighted based on their entropy statistics. Our method, validated on 24 source/target domain splits across 3 medical image datasets surpasses the leading method by 2.9% Dice coefficient on average.
Current hardware security verification processes predominantly rely on manual threat modeling and test plan generation, which are labor-intensive, error-prone, and struggle to scale with increasing design complexity and evolving attack methodologies. To address these challenges, we propose ThreatLens, an LLM-driven multi-agent framework that automates security threat modeling and test plan generation for hardware security verification. ThreatLens integrates retrieval-augmented generation (RAG) to extract relevant security knowledge, LLM-powered reasoning for threat assessment, and interactive user feedback to ensure the generation of practical test plans. By automating these processes, the framework reduces the manual verification effort, enhances coverage, and ensures a structured, adaptable approach to security verification. We evaluated our framework on the NEORV32 SoC, demonstrating its capability to automate security verification through structured test plans and validating its effectiveness in real-world scenarios.
Frequent modifications of unit test cases are inevitable due to software's continuous underlying changes in source code, design, and requirements. Since manually maintaining software test suites is tedious, timely, and costly, automating the process of generation and maintenance of test units will significantly impact the effectiveness and efficiency of software testing processes. To this end, we propose an automated approach which exploits both structural and semantic properties of source code methods and test cases to recommend the most relevant and useful unit tests to the developers. The proposed approach initially trains a neural network to transform method-level source code, as well as unit tests, into distributed representations (embedded vectors) while preserving the importance of the structure in the code. Retrieving the semantic and structural properties of a given method, the approach computes cosine similarity between the method's embedding and the previously-embedded training instances. Further, according to the similarity scores between the embedding vectors, the model identifies the closest methods of embedding and the associated unit tests as the most similar recomm
Test flakiness is a problem that affects testing and processes that rely on it. Several factors cause or influence the flakiness of test outcomes. Test execution order, randomness and concurrency are some of the more common and well-studied causes. Some studies mention code instrumentation as a factor that causes or affects test flakiness. However, evidence for this issue is scarce. In this study, we attempt to systematically collect evidence for the effects of instrumentation on test flakiness. We experiment with common types of instrumentation for Java programs - namely, application performance monitoring, coverage and profiling instrumentation. We then study the effects of instrumentation on a set of nine programs obtained from an existing dataset used to study test flakiness, consisting of popular GitHub projects written in Java. We observe cases where real-world instrumentation causes flakiness in a program. However, this effect is rare. We also discuss a related issue - how instrumentation may interfere with flakiness detection and prevention.
Two new omnibus tests of uniformity for data on the hypersphere are proposed. The new test statistics exploit closed-form expressions for orthogonal polynomials, feature tuning parameters, and are related to a "smooth maximum" function and the Poisson kernel. We obtain exact moments of the test statistics under uniformity and rotationally symmetric alternatives, and give their null asymptotic distributions. We consider approximate oracle tuning parameters that maximize the power of the tests against known generic alternatives and provide tests that estimate oracle parameters through cross-validated procedures while maintaining the significance level. Numerical experiments explore the effectiveness of null asymptotic distributions and the accuracy of inexpensive approximations of exact null distributions. A simulation study compares the powers of the new tests with other tests of the Sobolev class, showing the benefits of the former. The proposed tests are applied to the study of the (seemingly uniform) nursing times of wild polar bears.
We suggest a dependence coefficient between a categorical variable and some general variable taking values in a metric space. We derive important theoretical properties and study the large sample behaviour of our suggested estimator. Moreover, we develop an independence test which has an asymptotic $χ^2$-distribution if the variables are independent and prove that this test is consistent against any violation of independence. The test is also applicable to the classical~$K$-sample problem with possibly high- or infinite-dimensional distributions. We discuss some extensions, including a variant of the coefficient for measuring conditional dependence.
Recent safety standards set stringent requirements for the target fault coverage in embedded microprocessors, with the objective to guarantee robustness and functional safety of the critical electronic systems. This motivates the need for improving the quality of test generation for microprocessors. A new high-level implementation-independent test generation method for RISC processors is proposed. The set of instructions of the processor is partitioned nto groups. For each group, a dedicated test template is created, to be used for generating two test programs, for testing the control and the data paths respectively. For testing the control part, a novel high-level control fault model is proposed. Using this model, a set of deterministic test data operands are generated for each instruction of the given group. The advantage of the high-level fault model is that it covers larger than SAF fault class including multiple fault coverage in the control part. For generating the data path test, pseudoexhaustive data operands are used. We investigated the feasibility of the approach and demonstrated high efficiency of the generated test programs for testing the execute module of the miniMIP
Multi-site testing is a popular and effective way to increase test throughput and reduce test costs. We present a test throughput model, in which we focus on wafer testing, and consider parameters like test time, index time, abort-on-fail, and contact yield. Conventional multi-site testing requires sufficient ATE resources, such as ATE channels, to allow to test multiple SOCs in parallel. In this paper, we design and optimize on-chip DfT, in order to maximize the test throughput for a given SOC and ATE. The on-chip DfT consists of an E-RPCT wrapper, and, for modular SOCs, module wrappers and TAMs. We present experimental results for a Philips SOC and several ITC'02 SOC Test Benchmarks.
On a commercial digital still camera (DSC) controller chip we practice a novel SOC test integration platform, solving real problems in test scheduling, test IO reduction, timing of functional test, scan IO sharing, embedded memory built-in self-test (BIST), etc. The chip has been fabricated and tested successfully by our approach. Test results justify that short test integration cost, short test time, and small area overhead can be achieved. To support SOC testing, a memory BIST compiler and an SOC testing integration system have been developed.
Although large Vision-Language Models (VLMs) have demonstrated remarkable performance in a wide range of multimodal tasks, their true reasoning capabilities on human IQ tests remain underexplored. To advance research on the fluid intelligence of VLMs, we introduce **IQBench**, a new benchmark designed to evaluate VLMs on standardized visual IQ tests. We focus on evaluating the reasoning capabilities of VLMs, which we argue are more important than the accuracy of the final prediction. **Our benchmark is visually centric, minimizing the dependence on unnecessary textual content**, thus encouraging models to derive answers primarily from image-based information rather than learned textual knowledge. To this end, we manually collected and annotated 500 visual IQ questions to **prevent unintentional data leakage during training**. Unlike prior work that focuses primarily on the accuracy of the final answer, we evaluate the reasoning ability of the models by assessing their explanations and the patterns used to solve each problem, along with the accuracy of the final prediction and human evaluation. Our experiments show that there are substantial performance disparities between tasks, wi