To evaluate the performance of audio signal processing algorithms and to train data-driven algorithms, e.g., as applied in hearing instruments, either simulated or recorded data can be used. While large batches of simulated data can be generated using mathematical models, recorded data provide a more adequate representation of real-life scenarios. Therefore, in this paper, the Hearing Instrument Dataset in Various Acoustical Scenarios (HIDVAS) is introduced. This dataset consists of both impulse responses and audio recordings using eight external loudspeakers, two external microphones, and a dummy head. On this dummy head behind-the-ear (BTE) hearing instrument shells with two microphones per shell are mounted, and in the dummy head's ears receiver-in-canal (RIC) hearing instrument loudspeakers are inserted. The dummy head also contains microphones located at its eardrum. The impulse responses have been computed from a swept-sine recording for each microphone-loudspeaker pair, and the audio recordings have been obtained by playing back audio (male and female speech, speech shaped noise, singing voice, stringed instrument, wind instrument, and percussion instrument) through each ind
When listening to a sound source in everyday-life situations, typical movement behavior can lead to a mismatch between the direction of the head and the direction of interest. This could reduce the performance of directional algorithms, as was shown in previous work for head movements of normal-hearing listeners. However, the movement behavior of hearing-impaired listeners and hearing aid users might be different, and if hearing aid users adapt their self-motion because of the directional algorithm, its performance might increase. In this work we therefore investigated the influence of hearing impairment on self-motion, and the interaction of hearing aids with self-motion. In order to do this, the self-motion of three hearing-impaired (HI) participant groups, aided with an adaptive differential microphone (ADM), aided without ADM, and unaided, was compared, also to previously measured self-motion data from younger and older normal-hearing (NH) participants. The self-motion was measured in virtual audiovisual environments (VEs) in the laboratory. Furthermore, the signal-to-noise ratios (SNRs) and SNR improvement of the ADM resulting from the head movements of the participants were e
Video and audio content creation serves as the core technique for the movie industry and professional users. Recently, existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation. We observe the powerful generation ability of off-the-shelf video or audio generation models. Thus, instead of training the giant models from scratch, we propose to bridge the existing strong models with a shared latent representation space. Specifically, we propose a multimodality latent aligner with the pre-trained ImageBind model. Our latent aligner shares a similar core as the classifier guidance that guides the diffusion denoising process during inference time. Through carefully designed optimization strategy and loss functions, we show the superior performance of our method on joint video-audio generation, visual-steered audio generation, and audio-steered visual generation tasks. The project website can be found at https://yzxing87.github.io/Seeing-and-Hearing/
To improve the sound quality of hearing devices, equalization filters can be used that aim at achieving acoustic transparency, i.e., listening with the device in the ear is perceptually similar to the open ear. The equalization filter needs to ensure that the superposition of the equalized signal played by the device and the signal leaking through the device into the ear canal matches a processed version of the signal reaching the eardrum of the open ear. Depending on the processing delay of the hearing device, comb-filtering artifacts can occur due to this superposition, which may degrade the perceived sound quality. In this paper we propose a unified least-squares-based procedure to design single- and multi-loudspeaker equalization filters for hearing devices aiming at achieving acoustic transparency. To account for non-minimum phase components, we introduce a so-called acausality management. To reduce comb-filtering artifacts, we propose to use a frequency-dependent regularization. Experimental results using measured acoustic transfer functions from a multi-loudspeaker earpiece show that the proposed equalization filter design procedure enables to achieve robust acoustic transpa
Background. Hearing aid technology has proven successful in the rehabilitation of hearing loss, but its performance is still limited in difficult everyday conditions characterized by noise and reverberation. Objectives. Introduction to the current state of hearing aid technology and presentation of the current state of research and future development. Methods. Current literature is analyzed and several specific new developments are presented. Results. Both objective and subjective data from empirical studies show the limitation of current technology. Examples of current research show the potential of machine-learning based algorithms and multi-modal signal processing for improving speech processing and perception, of using virtual reality for improving hearing device fitting and of mobile health technology for improving hearing-health services. Conclusions. Hearing device technology will remain a key factor in the rehabilitation of hearing impairment. New technology such as machine learning, and multi-modal signal processing, virtual reality and mobile health technology will improve speech enhancement, individual fitting and communication training.
Current assistive hearing devices, such as hearing aids and cochlear implants, lack the ability to adapt to the listener's focus of auditory attention, limiting their effectiveness in complex acoustic environments like cocktail party scenarios where multiple conversations occur simultaneously. Neuro-steered hearing devices aim to overcome this limitation by decoding the listener's auditory attention from neural signals, such as electroencephalography (EEG). While many auditory attention decoding (AAD) studies have used high-density scalp EEG, such systems are impractical for daily use as they are bulky and uncomfortable. Therefore, AAD with wearable and unobtrusive EEG systems that are comfortable to wear and can be used for long-term recording are required. Around-ear EEG systems like cEEGrids have shown promise in AAD, but in-ear EEG, recorded via custom earpieces offering superior comfort, remains underexplored. We present a new AAD dataset with simultaneously recorded scalp, around-ear, and in-ear EEG, enabling a direct comparison. Using a classic linear stimulus reconstruction algorithm, a significant performance gap between all three systems exists, with AAD accuracies of 83.
We present the first wireless earbud hardware that can perform hearing screening by detecting otoacoustic emissions. The conventional wisdom has been that detecting otoacoustic emissions, which are the faint sounds generated by the cochlea, requires sensitive and expensive acoustic hardware. Thus, medical devices for hearing screening cost thousands of dollars and are inaccessible in low and middle income countries. We show that by designing wireless earbuds using low-cost acoustic hardware and combining them with wireless sensing algorithms, we can reliably identify otoacoustic emissions and perform hearing screening. Our algorithms combine frequency modulated chirps with wideband pulses emitted from a low-cost speaker to reliably separate otoacoustic emissions from in-ear reflections and echoes. We conducted a clinical study with 50 ears across two healthcare sites. Our study shows that the low-cost earbuds detect hearing loss with 100% sensitivity and 89.7% specificity, which is comparable to the performance of a $8000 medical device. By developing low-cost and open-source wearable technology, our work may help address global health inequities in hearing screening by democratizi
Early and accurate detection systems for ear diseases, powered by deep learning, are essential for preventing hearing impairment and improving population health. However, the limited diversity of existing otoendoscopy datasets and the poor balance between diagnostic accuracy, computational efficiency, and model size have hindered the translation of artificial intelligence (AI) algorithms into healthcare applications. In this study, we constructed a large-scale, multi-center otoendoscopy dataset covering eight common ear diseases and healthy cases. Building upon this resource, we developed Best-EarNet, an ultrafast and lightweight deep learning architecture integrating a novel Local-Global Spatial Feature Fusion Module with a multi-scale supervision strategy, enabling real-time and accurate classification of ear conditions. Leveraging transfer learning, Best-EarNet, with a model size of only 2.94 MB, achieved diagnostic accuracies of 95.23% on an internal test set (22,581 images) and 92.14% on an external test set (1,652 images), while requiring only 0.0125 seconds (80 frames per second) to process a single image on a standard CPU. Further subgroup analysis by gender and age showed
The human ear canal couples the external sound field to the eardrum and the solid parts of the middle ear. Therefore, knowledge of the acoustic impedance of the human ear is widely used in the industry to develop audio devices such as smartphones, headsets, and hearing aids. In this study acoustic impedance measurements in the human ear canal of 32 adult subjects is presented. Wideband measurement techniques developed specifically for this purpose enable impedance measurement to be obtained in the full audio band up to 20kHz. Full ear canal geometries of all subjects are also available from the first of its kind in vivo based magnetic resonance imaging study of the human outer ear. These ear canal geometries are used to obtain individual ear moulds of all subjects and to process the data. By utilizing a theoretical Webster's horn description, the measured impedance is propagated in each ear canal to a common theoretical reference plane across all subjects. At this plane the mean human impedance and standard deviation of the population is found. The results are further demographically divided by gender and age and compared to a widely used ear simulator (the IEC711 coupler).
Ear occlusions (arising from the presence of ear accessories such as earrings and earphones) can negatively impact performance in ear-based biometric recognition systems, especially in unconstrained imaging circumstances. In this study, we assess the effectiveness of a diffusion-based ear inpainting technique as a pre-processing aid to mitigate the issues of ear accessory occlusions in transformer-based ear recognition systems. Given an input ear image and an automatically derived accessory mask, the inpainting model reconstructs clean and anatomically plausible ear regions by synthesizing missing pixels while preserving local geometric coherence along key ear structures, including the helix, antihelix, concha, and lobule. We evaluate the effectiveness of this pre-processing aid in transformer-based recognition systems for several vision transformer models and different patch sizes for a range of benchmark datasets. Experiments show that diffusion-based inpainting can be a useful pre-processing aid to alleviate ear accessory occlusions to improve overall recognition performance.
Ear recognition is a contactless and unobtrusive biometric technique with applications across various domains. However, deploying high-performing ear recognition models on resource-constrained devices is challenging, limiting their applicability and widespread adoption. This paper introduces EdgeEar, a lightweight model based on a proposed hybrid CNN-transformer architecture to solve this problem. By incorporating low-rank approximations into specific linear layers, EdgeEar reduces its parameter count by a factor of 50 compared to the current state-of-the-art, bringing it below two million while maintaining competitive accuracy. Evaluation on the Unconstrained Ear Recognition Challenge (UERC2023) benchmark shows that EdgeEar achieves the lowest EER while significantly reducing computational costs. These findings demonstrate the feasibility of efficient and accurate ear recognition, which we believe will contribute to the wider adoption of ear biometrics.
The integration of artificial intelligence into hearing assistance marks a paradigm shift from traditional amplification-based systems to intelligent, context-aware audio processing. This systematic literature review evaluates advances in AI-driven selective noise cancellation (SNC) for hearing aids, highlighting technological evolution, implementation challenges, and future research directions. We synthesize findings across deep learning architectures, hardware deployment strategies, clinical validation studies, and user-centric design. The review traces progress from early machine learning models to state-of-the-art deep networks, including Convolutional Recurrent Networks for real-time inference and Transformer-based architectures for high-accuracy separation. Key findings include significant gains over traditional methods, with recent models achieving up to 18.3 dB SI-SDR improvement on noisy-reverberant benchmarks, alongside sub-10 ms real-time implementations and promising clinical outcomes. Yet, challenges remain in bridging lab-grade models with real-world deployment - particularly around power constraints, environmental variability, and personalization. Identified research
Ear recognition can be described as a revived scientific field. Ear biometrics were long believed to not be accurate enough and held a secondary place in scientific research, being seen as only complementary to other types of biometrics, due to difficulties in measuring correctly the ear characteristics and the potential occlusion of the ear by hair, clothes and ear jewellery. However, recent research has reinstated them as a vivid research field, after having addressed these problems and proven that ear biometrics can provide really accurate identification and verification results. Several 2D and 3D imaging techniques, as well as acoustical techniques using sound emission and reflection, have been developed and studied for ear recognition, while there have also been significant advances towards a fully automated recognition of the ear. Furthermore, ear biometrics have been proven to be mostly non-invasive, adequately permanent and accurate, and hard to spoof and counterfeit. Moreover, different ear recognition techniques have proven to be as effective as face recognition ones, thus providing the opportunity for ear recognition to be used in identification and verification applicat
Developing and selecting hearing aids is a time consuming process which is simplified by using objective models. Previously, the framework for auditory discrimination experiments (FADE) accurately simulated benefits of hearing aid algorithms with root mean squared prediction errors below 3 dB. One FADE simulation requires several hours of (un)processed signals, which is obstructive when the signals have to be recorded. We propose and evaluate a data-reduced FADE version (DARF) which facilitates simulations with signals that cannot be processed digitally, but that can only be recorded in real-time. DARF simulates one speech recognition threshold (SRT) with about 30 minutes of recorded and processed signals of the (German) matrix sentence test. Benchmark experiments were carried out to compare DARF and standard FADE exhibiting small differences for stationary maskers (1 dB), but larger differences with strongly fluctuating maskers (5 dB). Hearing impairment and hearing aid algorithms seemed to reduce the differences. Hearing aid benefits were simulated in terms of speech recognition with three pairs of real hearing aids in silence ($\geq$8 dB), in stationary and fluctuating maskers i
We present a database of acoustic transfer functions of the Hearpiece, an openly available multi-microphone multi-driver in-the-ear earpiece for hearing device research. The database includes HRTFs for 87 incidence directions as well as responses of the drivers, all measured at the four microphones of the Hearpiece as well as the eardrum in the occluded and open ear. The transfer functions were measured in both ears of 25 human subjects and a KEMAR with anthropometric pinnae for five reinsertions of the device. We describe the measurements of the database and analyse derived acoustic parameters of the device. All regarded transfer functions are subject to differences between subjects as well as variations due to reinsertion into the same ear. Also, the results show that KEMAR measurements represent a median human ear well for all assessed transfer functions. The database is a rich basis for development, evaluation and robustness analysis of multiple hearing device algorithms and applications. The database is openly available at https://doi.org/10.5281/zenodo.3733191.
This study explores the significance of robot hearing systems, emphasizing their importance for robots operating in diverse and uncertain environments. It introduces the hardware design principles using robotaxis as an example, where exterior microphone arrays are employed to detect sound events such as sirens. The challenges, goals, and test methods are discussed, focusing on achieving a suitable signal-to-noise ratio (SNR). Additionally, it presents a preliminary software framework rooted in probabilistic robotics theory, advocating for the integration of robot hearing into the broader context of perception and decision-making. It discusses various models, including Bayes filters, partially observable Markov decision processes (POMDP), and multiagent systems, highlighting the multifaceted roles that robot hearing can play. In conclusion, as service robots continue to evolve, robot hearing research will expand, offering new perspectives and challenges for future development beyond simple sound event classification.
Smart glasses are becoming an increasingly prevalent wearable platform, with audio as a key interaction modality. However, hearing in noisy environments remains challenging because smart glasses are equipped with open-ear speakers that do not seal the ear canal. Furthermore, the open-ear design is incompatible with conventional active noise cancellation (ANC) techniques, which rely on an error microphone inside or at the entrance of the ear canal to measure the residual sound heard after cancellation. Here we present the first real-time ANC system for open-ear smart glasses that suppresses environmental noise using only microphones and miniaturized open-ear speakers embedded in the glasses frame. Our low-latency computational pipeline estimates the noise at the ear from an array of eight microphones distributed around the glasses frame and generates an anti-noise signal in real-time to cancel environmental noise. We develop a custom glasses prototype and evaluate it in a user study across 8 environments under mobility in the 100--1000 Hz frequency range, where environmental noise is concentrated. We achieve a mean noise reduction of 9.6 dB without any calibration, and 11.2 dB with
Ear recognition has gained attention as a reliable biometric technique due to the distinctive characteristics of human ears. With the increasing availability of large-scale datasets, convolutional neural networks (CNNs) have been widely adopted to learn features directly from raw ear images, outperforming traditional hand-crafted methods. However, the effect of bilateral ear symmetry on the features learned by CNNs has received little attention in recent studies. In this paper, we investigate how bilateral ear symmetry influences the effectiveness of CNN-based ear recognition. To this end, we first develop an ear side classifier to automatically categorize ear images as either left or right. We then explore the impact of incorporating this side information during both training and test. Cross-dataset evaluations are conducted on five datasets. Our results suggest that treating left and right ears separately during training and testing can lead to notable performance improvements. Furthermore, our ablation studies on alignment strategies, input sizes, and various hyperparameter settings provide practical insights into training CNN-based ear recognition systems on large-scale dataset
In this paper, we present a detailed analysis on extracting soft biometric traits, age and gender, from ear images. Although there have been a few previous work on gender classification using ear images, to the best of our knowledge, this study is the first work on age classification from ear images. In the study, we have utilized both geometric features and appearance-based features for ear representation. The utilized geometric features are based on eight anthropometric landmarks and consist of 14 distance measurements and two area calculations. The appearance-based methods employ deep convolutional neural networks for representation and classification. The well-known convolutional neural network models, namely, AlexNet, VGG-16, GoogLeNet, and SqueezeNet have been adopted for the study. They have been fine-tuned on a large-scale ear dataset that has been built from the profile and close-to-profile face images in the Multi-PIE face dataset. This way, we have performed a domain adaptation. The updated models have been fine-tuned once more time on the small-scale target ear dataset, which contains only around 270 ear images for training. According to the experimental results, appear
Biometric-based authentication is gaining increasing attention for wearables and mobile applications. Meanwhile, the growing adoption of sensors in wearables also provides opportunities to capture novel wearable biometrics. In this work, we propose EarDynamic, an ear canal deformation-based user authentication using in-ear wearables. EarDynamic provides continuous and passive user authentication and is transparent to users. It leverages ear canal deformation that combines the unique static geometry and dynamic motions of the ear canal when the user is speaking for authentication. It utilizes an acoustic sensing approach to capture the ear canal deformation with the built-in microphone and speaker of the in-ear wearable. Specifically, it first emits well-designed inaudible beep signals and records the reflected signals from the ear canal. It then analyzes the reflected signals and extracts fine-grained acoustic features that correspond to the ear canal deformation for user authentication. Our extensive experimental evaluation shows that EarDynamic can achieve a recall of 97.38% and an F1 score of 96.84%. Results also show that our system works well under different noisy environments