This research seeks to benefit the software engineering society by proposing comparative separation, a novel group fairness notion to evaluate the fairness of machine learning software on comparative judgment test data. Fairness issues have attracted increasing attention since machine learning software is increasingly used for high-stakes and high-risk decisions. It is the responsibility of all software developers to make their software accountable by ensuring that the machine learning software do not perform differently on different sensitive groups -- satisfying the separation criterion. However, evaluation of separation requires ground truth labels for each test data point. This motivates our work on analyzing whether separation can be evaluated on comparative judgment test data. Instead of asking humans to provide the ratings or categorical labels on each test data point, comparative judgments are made between pairs of data points such as A is better than B. According to the law of comparative judgment, providing such comparative judgments yields a lower cognitive burden for humans than providing ratings or categorical labels. This work first defines the novel fairness notion c
In conversational speech separation and recognition tasks, close-talk microphones are typically attached to each speaker during training data collection to capture near-field, close-talk mixture signals, in addition to using far-field microphones to record far-field mixture signals. Each such close-talk mixture exhibits a reasonably high energy level for the wearer and could intuitively serve as weak supervision for training far-field speech separation models directly on real-recorded far-field signals. However, they are not sufficiently clean for this purpose, as they often contain strong cross-talk speech from other speakers in addition to background noise. To address this, we propose cross-talk reduction (CTR), a task aiming to isolate the wearer's speech from each close-talk mixture, and a novel method called CTRnet, which can be trained directly on real-recorded pairs of close-talk and far-field mixtures to accomplish CTR. Building on CTRnet, we further propose pseudo-label based far-field speech separation (PuLSS), which uses CTRnet's estimated clean speech as pseudo-labels to train models for separating far-field mixtures. A key advantage of the proposed framework is that bo
Single-image reflection separation is highly challenging under extreme conditions like glare or weak reflections. Existing methods often struggle to recover both layers in glare or weak-reflection scenarios because of insufficient information. This paper presents a diffusion model explicitly fine-tuned for this task, leveraging generative diffusion priors for robust separation. Our method simultaneously generates transmission and reflection layers through a unified diffusion model, incorporating a novel cross-layer self-attention mechanism for better feature disentanglement. We further introduce a disjoint sampling strategy to iteratively reduce interference between the layers during diffusion and a latent optimization step with a learned composition function for improved results in complex real-world scenarios. Extensive experiments demonstrate that our approach surpasses state-of-the-art methods on multiple real-world benchmarks. Project page: https://brian90709.github.io/diff-reflection-separation/
We identify a strong structural obstruction to Uniform Separation in constructive arithmetic. The mechanism is independent of semantic content; it emerges whenever two distinct evaluator predicates are sustained in parallel and inference remains uniformly representable in an extension of HA. Under these conditions, any putative Uniform Class Separation principle becomes a distinguished instance of a fixed point construction. The resulting limitation is stricter in scope than classical separation barriers (Baker; Rudich; Aaronson et al.) insofar as it constrains the logical form of uniform separation within HA, rather than limiting particular relativizing, naturalizing, or algebrizing techniques.
We study a coupled driven system where two different species of particles, along with some vacancies or holes, move on a landscape whose shape fluctuates with time. The movement of the particles is guided by the local shape of the landscape, and this shape is also affected by the presence of different particle species. When a particle species push the landscape in the same (opposite) direction of its own motion, it is called an aligned (a reverse) bias. Aligned bias promotes ordering while reverse bias destroys it. In absence of vacancies, the system reduces to previously studied LH model with different kinds of ordered and disordered phases which could be explained as a competition or cooperation between aligned bias and reverse bias. This interplay is expected to remain unaffected even when vacancies are present since vacancies do not impart any kind of bias on the landscape. However, we find presence of vacancies effectively weakens the reverse bias and this significantly changes the outcome of the competition between the two bias types. As a result novel ordered phases emerge which were not seen before. We analytically calculate the new phase boundaries within mean field approx
The problem of speech separation, also known as the cocktail party problem, refers to the task of isolating a single speech signal from a mixture of speech signals. Previous work on source separation derived an upper bound for the source separation task in the domain of human speech. This bound is derived for deterministic models. Recent advancements in generative models challenge this bound. We show how the upper bound can be generalized to the case of random generative models. Applying a diffusion model Vocoder that was pretrained to model single-speaker voices on the output of a deterministic separation model leads to state-of-the-art separation results. It is shown that this requires one to combine the output of the separation model with that of the diffusion model. In our method, a linear combination is performed, in the frequency domain, using weights that are inferred by a learned model. We show state-of-the-art results on 2, 3, 5, 10, and 20 speakers on multiple benchmarks. In particular, for two speakers, our method is able to surpass what was previously considered the upper performance bound.
This paper introduces a dynamic logic extension of separation logic. The assertion language of separation logic is extended with modalities for the five types of the basic instructions of separation logic: simple assignment, look-up, mutation, allocation, and de-allocation. The main novelty of the resulting dynamic logic is that it allows to combine different approaches to resolving these modalities. One such approach is based on the standard weakest precondition calculus of separation logic. The other approach introduced in this paper provides a novel alternative formalization in the proposed dynamic logic extension of separation logic. The soundness and completeness of this axiomatization has been formalized in the Coq theorem prover.
While exponential separations are known between quantum and randomized communication complexity for partial functions (Raz, STOC 1999), the best known separation between these measures for a total function is quadratic, witnessed by the disjointness function. We give the first super-quadratic separation between quantum and randomized communication complexity for a total function, giving an example exhibiting a power 2.5 gap. We further present a 1.5 power separation between exact quantum and randomized communication complexity, improving on the previous ~1.15 separation by Ambainis (STOC 2013). Finally, we present a nearly optimal quadratic separation between randomized communication complexity and the logarithm of the partition number, improving upon the previous best power 1.5 separation due to Göös, Jayram, Pitassi, and Watson. Our results are the communication analogues of separations in query complexity proved using the recent cheat sheet framework of Aaronson, Ben-David, and Kothari (STOC 2016). Our main technical results are randomized communication and information complexity lower bounds for a family of functions, called lookup functions, that generalize and port the cheat
Incorrectness Separation Logic (ISL) is a proof system that is tailored specifically to resolve problems of under-approximation in programs that manipulate heaps, and it primarily focuses on bug detection. This approach is different from the over-approximation methods that are used in traditional logics such as Hoare Logic or Separation Logic. Although the soundness of ISL has been established, its completeness remains unproven. In this study, we establish relative completeness by leveraging the expressiveness of the weakest postconditions; expressiveness is a factor that is critical to demonstrating relative completeness in Reverse Hoare Logic. In our ISL framework, we allow for infinite disjunctions in disjunctive normal forms, where each clause comprises finite symbolic heaps with existential quantifiers. To compute the weakest postconditions in ISL, we introduce a canonicalization that includes variable aliasing.
Speech separation has been studied widely for single-channel close-talk microphone recordings over the past few years; developed solutions are mostly in frequency-domain. Recently, a raw audio waveform separation network (TasNet) is introduced for single-channel data, with achieving high Si-SNR (scale-invariant source-to-noise ratio) and SDR (source-to-distortion ratio) comparing against the state-of-the-art solution in frequency-domain. In this study, we incorporate effective components of the TasNet into a frequency-domain separation method. We compare both for alternative scenarios. We introduce a solution for directly optimizing the separation criterion in frequency-domain networks. In addition to speech separation objective and subjective measurements, we evaluate the separation performance on a speech recognition task as well. We study the speech separation problem for far-field data (more similar to naturalistic audio streams) and develop multi-channel solutions for both frequency and time-domain separators with utilizing spectral, spatial and speaker location information. For our experiments, we simulated multi-channel spatialized reverberate WSJ0-2mix dataset. Our experime
Recently, supervised speech separation has made great progress. However, limited by the nature of supervised training, most existing separation methods require ground-truth sources and are trained on synthetic datasets. This ground-truth reliance is problematic, because the ground-truth signals are usually unavailable in real conditions. Moreover, in many industry scenarios, the real acoustic characteristics deviate far from the ones in simulated datasets. Therefore, the performance usually degrades significantly when applying the supervised speech separation models to real applications. To address these problems, in this study, we propose a novel separation consistency training, termed SCT, to exploit the real-world unlabeled mixtures for improving cross-domain unsupervised speech separation in an iterative manner, by leveraging upon the complementary information obtained from heterogeneous (structurally distinct but behaviorally complementary) models. SCT follows a framework using two heterogeneous neural networks (HNNs) to produce high confidence pseudo labels of unlabeled real speech mixtures. These labels are then updated, and used to refine the HNNs to produce more reliable c
We present TokenSplit, a speech separation model that acts on discrete token sequences. The model is trained on multiple tasks simultaneously: separate and transcribe each speech source, and generate speech from text. The model operates on transcripts and audio token sequences and achieves multiple tasks through masking of inputs. The model is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture. We also present a "refinement" version of the model that predicts enhanced audio tokens from the audio tokens of speech separated by a conventional separation model. Using both objective metrics and subjective MUSHRA listening tests, we show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning. We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
Multiparty session types (MP) are a type discipline for enforcing the structured, deadlock-free communication of concurrent and message-passing programs. Traditional MP have a limited form of choice in which alternative communication possibilities are offered by a single participant and selected by another. Mixed choice multiparty session types (MCMP) extend the choice construct to include both selections and offers in the same choice. This paper first proposes a general typing system for a mixed choice synchronous multiparty session calculus, and prove type soundness, communication safety, and deadlock-freedom. Next we compare expressiveness of nine subcalcli of MCMP-calculus by examining their encodability (there exists a good encoding from one to another) and separation (there exists no good encoding from one calculus to another). We prove 8 new encodablity results and 20 new separation results. In summary, MCMP is strictly more expressive than classical multiparty sessions (MP) and mixed choice in mixed sessions. This contrasts earlier results where mixed sessions do not add any expressiveness to non-mixed fundamental sessions, shedding a light on expressiveness of multiparty m
Sound over-approximation methods have been proved effective for guaranteeing the absence of errors, but inevitably they produce false alarms that can hamper the programmers. Conversely, under-approximation methods are aimed at bug finding and are free from false alarms. We introduce Sufficient Incorrectness Logic~(SIL), a new under-approximating, triple-based program logic to reason about program errors. SIL is designed to set apart the initial states leading to errors. We prove that SIL is correct and complete for a minimal set of rules, and we study additional rules that can facilitate program analyses. We formally compare SIL to existing triple-based program logics. Incorrectness Logic and SIL both perform under-approximations, but while the former exposes only true errors, the latter locates the set of initial states that lead to such errors. Hoare Logic performs over-approximations and as such cannot capture the set of initial states leading to errors in nondeterministic programs -- for deterministic and terminating programs, Hoare Logic and SIL coincide. Finally, we instantiate SIL with Separation Logic formulae (Separation SIL) to handle pointers and dynamic allocation and w
Modules in all existing speech separation networks can be categorized into single-input-multi-output (SIMO) modules and single-input-single-output (SISO) modules. SIMO modules generate more outputs than input, and SISO modules keep the numbers of input and output the same. While the majority of separation models only contain SIMO architectures, it has also been shown that certain two-stage separation systems integrated with a post-enhancement SISO module can improve the separation quality. Why performance improvements can be achieved by incorporating the SISO modules? Are SIMO modules always necessary? In this paper, we empirically examine those questions by designing models with varying configurations in the SIMO and SISO modules. We show that comparing with the standard SIMO-only design, a mixed SIMO-SISO design with a same model size is able to improve the separation performance especially under low-overlap conditions. We further validate the necessity of SIMO modules and show that SISO-only models are still able to perform separation without sacrificing the performance. The observations allow us to rethink the model design paradigm and present different views on how the separat
We answer two questions regarding the sum-of-squares for the SYK model left open in Ref. 1, both of which are related to graphs. First (a "limitation"), we show that a fragment of the sum-of-squares, in which one considers commutation relations of degree-$4$ Majorana operators but does not impose any other relations on them, does not give the correct order of magnitude bound on the ground state energy. Second (a "separation"), we show that the graph invariant $Ψ(G)$ defined in Ref. 1 may be strictly larger than the independence number $α(G)$. The invariant $Ψ(G)$ is a bound on the norm of a Hamiltonian whose terms obey commutation relations determined by the graph $G$, and it was shown that $α(G)\leq Ψ(G) \leq \vartheta(G)$, where $\vartheta(\cdot)$ is the Lovasz theta function. We briefly discuss the case of $q eq 4$ in the SYK model. Separately, we define a problem that we call the quantum knapsack problem.
Two of the fundamental no-go theorems of quantum information are the no-cloning theorem (that it is impossible to make copies of general quantum states) and the no-teleportation theorem (the prohibition on telegraphing, or sending quantum states over classical channels without pre-shared entanglement). They are known to be equivalent, in the sense that a collection of quantum states is telegraphable if and only if it is clonable. Our main result suggests that this is not the case when computational efficiency is considered. We give a collection of quantum states and quantum oracles relative to which these states are efficiently clonable but not efficiently telegraphable. Given that the opposite scenario is impossible (states that can be telegraphed can always trivially be cloned), this gives the most complete quantum oracle separation possible between these two important no-go properties. We additionally study the complexity class clonableQMA, a subset of QMA whose witnesses are efficiently clonable. As a consequence of our main result, we give a quantum oracle separation between clonableQMA and the class QCMA, whose witnesses are restricted to classical strings. We also propose a
Ambisonics is a scene-based spatial audio format that has several useful features compared to object-based formats, such as efficient whole scene rotation and versatility. However, it does not provide direct access to the individual source signals, so that these have to be separated from the mixture when required. Typically, this is done with linear spherical harmonics (SH) beamforming. In this paper, we explore deep-learning-based source separation on static Ambisonics mixtures. In contrast to most source separation approaches, which separate a fixed number of sources of specific sound types, we focus on separating arbitrary sound from specific directions. Specifically, we propose three operating modes that combine a source separation neural network with SH beamforming: refinement, implicit, and mixed mode. We show that a neural network can implicitly associate conditioning directions with the spatial information contained in the Ambisonics scene to extract specific sources. We evaluate the performance of the three proposed approaches and compare them to SH beamforming on musical mixtures generated with the musdb18 dataset, as well as with mixtures generated with the FUSS dataset
Music source separation (MSS) aims to separate a music recording into multiple musically distinct stems, such as vocals, bass, drums, and more. Recently, deep learning approaches such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been used, but the improvement is still limited. In this paper, we propose a novel frequency-domain approach based on a Band-Split RoPE Transformer (called BS-RoFormer). BS-RoFormer relies on a band-split module to project the input complex spectrogram into subband-level representations, and then arranges a stack of hierarchical Transformers to model the inner-band as well as inter-band sequences for multi-band mask estimation. To facilitate training the model for MSS, we propose to use the Rotary Position Embedding (RoPE). The BS-RoFormer system trained on MUSDB18HQ and 500 extra songs ranked the first place in the MSS track of Sound Demixing Challenge (SDX23). Benchmarking a smaller version of BS-RoFormer on MUSDB18HQ, we achieve state-of-the-art result without extra training data, with 9.80 dB of average SDR.
Robustness is a fundamental pillar of Machine Learning (ML) classifiers, substantially determining their reliability. Methods for assessing classifier robustness are therefore essential. In this work, we address the challenge of evaluating corruption robustness in a way that allows comparability and interpretability on a given dataset. We propose a test data augmentation method that uses a robustness distance $ε$ derived from the datasets minimal class separation distance. The resulting MSCR (minimal separation corruption robustness) metric allows a dataset-specific comparison of different classifiers with respect to their corruption robustness. The MSCR value is interpretable, as it represents the classifiers avoidable loss of accuracy due to statistical corruptions. On 2D and image data, we show that the metric reflects different levels of classifier robustness. Furthermore, we observe unexpected optima in classifiers robust accuracy through training and testing classifiers with different levels of noise. While researchers have frequently reported on a significant tradeoff on accuracy when training robust models, we strengthen the view that a tradeoff between accuracy and corrupt