Maps of cosmic microwave background (CMB) are extracted from multi-frequency observations using a variety of cleaning procedures. However, in regions of strong microwave emission, particularly in the galactic plane from our own galaxy Milky Way and some extended or point sources, the recovered CMB signal is not reliable. Thus, a galactic mask is provided along with the \emph{cleaned} CMB sky for use with that CMB map which excises sky regions that may still be potentially contaminated even after cleaning. So, to avoid bias in our inferences, we impose such a foreground mask. In this paper, we analyze a cleaned CMB map from Planck PR4 to probe for any foreground residuals that may still be present \emph{outside} the galactic mask where the derived CMB sky is considered clean. To that end, we employ a local cross-correlation coefficient statistic where we cross-correlate widely used foreground templates that trace galactic synchrotron, free-free, and thermal dust emission from our galaxy with the cleaned CMB sky. Using simulations, we find that few regions of the derived CMB sky are still contaminated and have to be omitted. Based on this study, we derived a mask that could be used i
Studies of cosmic microwave background (CMB) are often limited by foreground contamination. Foreground cleaning is performed either in harmonic or pixel space after data cuts have excluded sky areas of strong contamination. We present a nearly full-sky CMB temperature map with only 1% of pixels masked. To derive this map, we make use of six full-sky template maps at foreground-dominated frequencies from different experiments smoothed to $1^\circ$ and rely on the combination of these weighted maps to trace the morphology of foreground contamination. We do not impose any spectral index constraints, but only fit for template amplitudes at each target frequency. We clean WMAP and Planck maps at a set of target frequencies and conduct quality tests at the level of the maps, pixel histograms and power spectra to select four CMB maps that are cleaned with negligible foreground contamination and only 1% masked pixels and no inpainting. We recommend use of these cleaned CMB maps for low multipole ($\ell < 30$) studies.
This paper presents an improved LLM based model for Grammatical Error Detection (GED), which is a very challenging and equally important problem for many applications. The traditional approach to GED involved hand-designed features, but recently, Neural Networks (NN) have automated the discovery of these features, improving performance in GED. Traditional rule-based systems have an F1 score of 0.50-0.60 and earlier machine learning models give an F1 score of 0.65-0.75, including decision trees and simple neural networks. Previous deep learning models, for example, Bi-LSTM, have reported F1 scores within the range from 0.80 to 0.90. In our study, we have fine-tuned various transformer models using the Lang8 dataset rigorously cleaned by us. In our experiments, the BERT-base-uncased model gave an impressive performance with an F1 score of 0.91 and accuracy of 98.49% on training data and 90.53% on testing data, also showcasing the importance of data cleaning. Increasing model size using BERT-large-uncased or RoBERTa-large did not give any noticeable improvements in performance or advantage for this task, underscoring that larger models are not always better. Our results clearly show h
The driving factors behind the development of large language models (LLMs) with impressive learning capabilities are their colossal model sizes and extensive training datasets. Along with the progress in natural language processing, LLMs have been frequently made accessible to the public to foster deeper investigation and applications. However, when it comes to training datasets for these LLMs, especially the recent state-of-the-art models, they are often not fully disclosed. Creating training data for high-performing LLMs involves extensive cleaning and deduplication to ensure the necessary level of quality. The lack of transparency for training data has thus hampered research on attributing and addressing hallucination and bias issues in LLMs, hindering replication efforts and further advancements in the community. These challenges become even more pronounced in multilingual learning scenarios, where the available multilingual text datasets are often inadequately collected and cleaned. Consequently, there is a lack of open-source and readily usable dataset to effectively train LLMs in multiple languages. To overcome this issue, we present CulturaX, a substantial multilingual data
In this paper we present a new method to estimate a foreground cleaned Cosmic Microwave Background (CMB) map at a resolution of $1^\circ$ by minimizing the non-Gaussian properties of the cleaned map which arise dominantly due to diffuse foreground emission components from the Milky Way. We employ simple kurtosis statistic as the measure of non-Gaussian properties and perform a linear combination of 5 frequency maps provided by Wilkinson Microwave Anisotropy Probe (WMAP) in its 7 year data release in such a way that the cleaned map has a minimum kurtosis which leads to a non-Gaussianity minimized, foreground cleaned CMB map. We validate the method by performing Monte-Carlo simulations. To minimize any residual foreground contamination from the cleaned map we flag out region near the galactic plane based upon results from simulations. Outside the masked region our new estimate of CMB map matches well with the WMAP's ILC map. A simple pseudo-$C_l$ based CMB TT power spectrum derived from the non-gaussianity minimized map reproduces the earlier results of WMAP's power spectrum. {\it An important advantage of the method is that it does not introduce any negative bias in angular power sp
We perform an independent foreground analysis of the WMAP maps to produce a cleaned CMB map (available online) useful for cross-correlation with, e.g., galaxy and X-ray maps. We use a variant of the Tegmark & Efstathiou (1996) technique that is completely blind, making no assumptions about the CMB power spectrum, the foregrounds, WMAP detector noise or external templates. Compared with the foreground-cleaned internal linear combination map produced by the WMAP team, our map has the advantage of containing less non-CMB power (from foregrounds and detector noise) outside the Galactic plane. The difference is most important on the the angular scale of the first acoustic peak and below, since our cleaned map is at the highest (13') rather than lowest (49') WMAP resolution. We also produce a Wiener filtered CMB map, representing our best guess as to what the CMB sky actually looks like, as well as CMB-free maps at the five WMAP frequencies useful for foreground studies. We argue that our CMB map is clean enough that the lowest multipoles can be measured without any galaxy cut, and obtain a quadrupole value that is slightly less low than that from the cut-sky WMAP team analysis. This
We test for foreground residuals in the foreground cleaned Planck Cosmic Microwave Background (CMB) maps outside and inside U73 mask commonly used for cosmological analysis. The aim of this paper is to introduce a new method to validate masks by looking at the differences in cleaned maps obtained by different component separation methods. By analyzing the power spectrum as well as the mean, variance and skewness of needlet coefficients on bands outside and inside the U73 mask we first confirm that the pixels already masked by U73 are highly contaminated and cannot be used for cosmological analysis. We further find that the U73 mask needs extension in order to reduce large scale foreground residuals to a level of less than $20\%$ of the standard deviation of CMB fluctuations within the bands closest to the galactic equator. We also find 276 point sources in the cleaned foreground maps which are currently not masked by the U73 mask. Our final publicly available extended mask leaves $65.9\%$ of the sky for cosmological analysis. Note that this extended mask may be important for analyses on local sky patches; in full sky analyses the additional residuals near the galactic equator may a
Recently a symmetry-based method to test for statistical isotropy of the cosmic microwave background was developed. We apply the method to template-cleaned 3-year and 5-year WMAP-$DA$ maps. We examine a wide range of angular multipoles from $2 < l < 300$. The analysis detects statistically signicant signals of anisotropy inconsistent with an isotropic CMB in some of the foreground cleaned maps. We are unable to resolve whether the anomalies have a cosmological, local astrophysical or instrumental origin. Assuming the anisotropy arises due to residual foreground contamination, we estimate the residual foreground power in the maps. For the W band maps, we also find a highly improbable degree of isotropy we cannot explain. We speculate that excess isotropy may be caused by faulty modeling of detector noise.
The surface of ultra-thin materials plays a crucial role in determining the properties. This is particularly important in two-dimensional (2D) materials where the surface-bulk distinction is no longer present. While mechanical cleaning of two-dimensional materials to remove interfacial and surface contaminants is used to achieve better sample quality, low throughput and the challenging optimization of cleaning procedures hinder their widespread adoption. Here, we report on atomic force microscope (AFM)-based mechanical cleaning with modified AFM cantilevers for high-throughput and easy-to-implement cleaning of 2D materials and their heterostructures. A Pt-wedge is deposited via focused ion beam on the cantilever to improve the mechanical cleaning of samples and streamline the cleaning procedures. We demonstrate that a cleaning rate of 3 μ^2/s can be achieved with our modified cantilevers, compared to the 0.01 μ^2/s effective cleaning rate in pointy-tip cleaning. As showcases, we demonstrate that monolayer WS2 on h-BN exhibits much sharper photoluminescence (PL) emission at room temperature after AFM cleaning, and WS2 monolayers exhibit a higher quality contacts to cleaned Au electr
Pull Requests (PRs) are central to collaborative coding, summarizing code changes for reviewers. However, many PR descriptions are incomplete, uninformative, or have out-of-context content, compromising developer workflows and hindering AI-based generation models trained on commit messages and original descriptions as "ground truth." This study examines the prevalence of "noisy" PRs and evaluates their impact on state-of-the-art description generation models. To do so, we propose four cleaning heuristics to filter noise from an initial dataset of 169K+ PRs drawn from 513 GitHub repositories. We train four models-BART, T5, PRSummarizer, and iTAPE-on both raw and cleaned datasets. Performance is measured via ROUGE-1, ROUGE-2, and ROUGE-L metrics, alongside a manual evaluation to assess description quality improvements from a human perspective. Cleaning the dataset yields significant gains: average F1 improvements of 8.6% (ROUGE-1), 8.7% (ROUGE-2), and 8.5% (ROUGE-L). Manual assessment confirms higher readability and relevance in descriptions generated by the best-performing model, BART when trained on cleaned data. Dataset refinement markedly enhances PR description generation, offer
Streaming data can arise from a variety of contexts. Important use cases are continuous sensor measurements such as temperature, light or radiation values. In the process, streaming data may also contain data errors that should be cleaned before further use. Many studies from science and practice focus on data cleaning in a static context. However, in terms of data cleaning, streaming data has particularities that distinguish it from static data. In this paper, we have therefore undertaken an intensive exploration of data cleaning of data streams. We provide a detailed analysis of the applicability of data cleaning to data streams. Our theoretical considerations are evaluated in comprehensive experiments. Using a prototype framework, we show that cleaning is not consistent when working with data streams. An additional contribution is the investigation of requirements for streaming technologies in context of data cleaning.
The mathematics of redistricting is an area of study that has exploded in recent years. In particular, many different research groups and expert witnesses in court cases have used outlier analysis to argue that a proposed map is a gerrymander. This outlier analysis relies on having an ensemble of potential redistricting maps against which the proposed map is compared. Arguably the most widely-accepted method of creating such an ensemble is to use a Markov Chain Monte Carlo (MCMC) process. This process requires that various pieces of data be gathered, cleaned, and coalesced into a single file that can be used as the seed of the MCMC process. In this article, we describe how we have begun this cleaning process for each state, and made the resulting data available for the public at https://github.com/eveomett-states . At the time of submission, we have data for 22 states available for researchers, students, and the general public to easily access and analyze. We will continue the data cleaning process for each state, and we hope that the availability of these datasets will both further research in this area, and increase the public's interest in and understanding of modern techniques
The performance of deep learning models for music source separation heavily depends on training data quality. However, datasets are often corrupted by difficult-to-detect artifacts such as audio bleeding and label noise. Since the type and extent of contamination are typically unknown, cleaning methods targeting specific corruptions are often impractical. This paper proposes and evaluates two distinct, noise-agnostic data cleaning methods to address this challenge. The first approach uses data attribution via unlearning to identify and filter out training samples that contribute the least to producing clean outputs. The second leverages the Fréchet Audio Distance to measure and remove samples that are perceptually dissimilar to a small and trusted clean reference set. On a dataset contaminated with a simulated distribution of real-world noise, our unlearning-based methods produced a cleaned dataset and a corresponding model that outperforms both the original contaminated data and the small clean reference set used for cleaning. This result closes approximately 66.7\% of the performance gap between the contaminated baseline and a model trained on the same dataset without any contami
A ring is called clean if every element is the sum of an invertible element and an idempotent. This paper investigates the cleanness of AW*-algebras. We prove that all finite AW*-algebras are clean, affirmatively solving a question posed by Vas. We also prove that all countably decomposable infinite AW*-factors are clean. A *-ring is called almost *-clean if every element can be expressed as the sum of a non-zero-divisor and a projection. We show that an AW*-algebra is almost *-clean if and only if it is finite.
CLEAN is a well-established deconvolution approach to Fourier imaging at both radio wavelwengths and hard X-ray energies. However, specifically for hard X-ray imaging, CLEAN suffers two significant drawbacks: a rather limited degree of automation, and a tendency to under-resolution. This paper introduces a multi-scale version of CLEAN specifically tailored to the reconstruction of images from measurements observed by the Spectrometer/Telescope for Imaging X-rays (STIX) on-board Solar Orbiter. Using synthetic STIX data, this study shows that multi-scale CLEAN may represent a reliable solution to the two previously mentioned CLEAN limitations. Further, this paper shows the performances of CLEAN and its multi-scale release in reconstructing experimental real scenarios characterized by complex emission morphologies.
Scientific abstracts are often used as proxies for the content and thematic focus of research publications. However, a significant share of published abstracts contains extraneous information-such as publisher copyright statements, section headings, author notes, registrations, and bibliometric or bibliographic metadata-that can distort downstream analyses, particularly those involving document similarity or textual embeddings. We introduce an open-source, easy-to-integrate language model designed to clean English-language scientific abstracts by automatically identifying and removing such clutter. We demonstrate that our model is both conservative and precise, alters similarity rankings of cleaned abstracts and improves information content of standard-length embeddings.
We define the class of {\it unit uniquely clean} rings ({\it UnitUC} for short), that is a common generalization of uniquely clean rings and strongly nil clean rings. Abelian {\it UnitUC} rings are uniquely clean and {\it UnitUC} rings with nil Jacobson radical are strongly nil clean. These rings also generalize the UUC and CUC rings, defined by Calugareanu-Zhou in Mediterranean J. Math. (2023), which are rings whose clean elements are uniquely clean. These rings are also represent a natural generalization of the Boolian rings in that a ring is {\it UnitUC} if, and only if, it is exchange and Boolean modulo the Jacobson radical. The behavior of {\it UnitUC} rings under group ring and matrix ring extensions is investigated. Several examples are provided to explain and delimit the results.
Radio interferometric imaging has long relied on the CLEAN algorithm, valued for its speed, robustness, and integration with calibration pipelines. However, next-generation facilities such as the ngVLA, SKA, and ALMAs Wideband Sensitivity Upgrade will produce data volumes and dynamic ranges that exceed the scalability of traditional methods. CLEAN remains dominant due to its simplicity and accumulated expertise, yet its assumption of modeling the sky as point sources limits its ability to recover extended emission and hampers automation. We review CLEANs limitations and survey alternatives, including multiscale extensions, compressive sensing, Regularized Maximum Likelihood, Bayesian inference, and AI-driven approaches. Forward-modeling methods enable higher fidelity, flexible priors, and uncertainty quantification, albeit at greater computational cost. Hybrid approaches such as Autocorr-CLEAN, CG-CLEAN, and PolyCLEAN retain CLEANs workflow while incorporating modern optimization. We argue hybrids are best suited for the near term, while Bayesian and AI-based frameworks represent the long-term future of interferometric imaging.
In this paper, we study a new class of rings, called $\sqrt{J}$-clean rings. A ring in which every element can be expressed as the addition of an idempotent and an element from $\sqrt{J(R)}$ is called a $\sqrt{J}$-clean ring. Here, $\sqrt{J(R)}=\{ z\in R : z^n\in J(R) \ \mathrm{for \ some} \ n \geq 1 \}$ where, $J(R)$ is the Jacobson radical. We provide the basic properties of $\sqrt{J}$-clean rings. We also show that the class of semiboolean and nil clean rings is a proper subclass of the class of $\sqrt{J}$-clean rings, which itself is a proper subclass of clean rings. We obtain basic properties of $\sqrt{J}$-clean rings and give a characterization of $\sqrt{J}$-clean rings: a ring $R$ is a $\sqrt{J}$-clean ring iff $R/J(R)$ is a $\sqrt{J}$-clean ring and idempotents lift modulo $J(R)$. We also prove that a ring is a uniquely clean ring if and only if it is a uniquely $\sqrt{J}$-clean ring. Finally, several matrix extensions like $T_n(R)$ and $D_n(R)$ over a $\sqrt{J}$-clean ring are explored.
We report on a comprehensive experimental investigation into the spatial-spectral complexity of the laser beam during Kerr-induced beam self-cleaning in graded-index multimode fibers. We demonstrate the self-cleaning of beams using both transform-limited and chirped femtosecond pulses. By utilizing the spectrally resolved imaging technique, we examine variations in beam homogeneity during the beam cleanup process and reveal correlations observed among spatial beam profiles at different wavelengths for the various cleaned pulses. Our results significantly advance our understanding of Kerr-induced self-cleaning with chirped ultrafast pulses and offer new possibilities for diverse applications.