Recent Mixture-of-Experts (MoE)-based large language models (LLMs) such as Qwen-MoE and DeepSeek-MoE are transforming generative AI in natural language processing. However, these models require vast and diverse training data. Federated learning (FL) addresses this challenge by leveraging private data from heterogeneous edge devices for privacy-preserving MoE training. Nonetheless, traditional FL approaches require devices to host local MoE models, which is impractical for resource-constrained devices due to large model sizes. To address this, we propose DeepFusion, the first scalable federated MoE training framework that enables the fusion of heterogeneous on-device LLM knowledge via federated knowledge distillation, yielding a knowledge-abundant global MoE model. Specifically, DeepFusion features each device to independently configure and train an on-device LLM tailored to its own needs and hardware limitations. Furthermore, we propose a novel View-Aligned Attention (VAA) module that integrates multi-stage feature representations from the global MoE model to construct a predictive perspective aligned with on-device LLMs, thereby enabling effective cross-architecture knowledge dist
This paper presents the Low-Complexity Acoustic Scene Classification with Device Information Task of the DCASE 2025 Challenge, along with its baseline system. Continuing the focus on low-complexity models, data efficiency, and device mismatch from previous editions (2022-2024), this year's task introduces a key change: recording device information is now provided at inference time. This enables the development of device-specific models that leverage device characteristics-reflecting real-world deployment scenarios in which a model is designed with awareness of the underlying hardware. The training set matches the 25% subset used in the corresponding DCASE 2024 challenge, with no restrictions on external data use, highlighting transfer learning as a central topic. The baseline achieves 50.72% accuracy with a device-agnostic model, improving to 51.89% when incorporating device-specific fine-tuning. The task attracted 31 submissions from 12 teams, with 11 teams outperforming the baseline. The top-performing submission achieved an accuracy gain of more than 8 percentage points over the baseline on the evaluation set.
This work presents a detailed analysis of the photothermionic-photovoltaic hybrid solar device. The electrons in this hybrid device gain energy from both the solar photons and thermophotons generated within the device, and hence the device has the potential to offer a voltage boost compared to individual photothermionic or photovoltaic devices. We show that the gap size between the photothermionic emitter and the photovoltaic collector crucially affects the device performance due to the strong dependence of the electronic and photonic coupling strengths on this gap size. We also investigate how the current matching constraint between the thermionic and photovoltaic stages can affect the hybrid solar device performance by studying different hybrid device configurations. Moreover, the hybrid devices are compared with the single photothermionic solar device with a metallic collector. Interestingly, we observe that the addition of a photovoltaic stage meant to enable the hybrid device to capture the entire terrestrial solar spectrum does not necessarily lead to higher overall conversion efficiency.
Small Language Models (SLMs, or on-device LMs) have significantly fewer parameters than Large Language Models (LLMs). They are typically deployed on low-end devices, like mobile phones and single-board computers. Unlike LLMs, which rely on increasing model size for better generalisation, SLMs designed for edge applications are expected to have adaptivity to the deployment environments and energy efficiency given the device battery life constraints, which are not addressed in datacenter-deployed LLMs. This paper addresses these two requirements by proposing a training-free token embedding compression approach using Tensor-Train Decomposition (TTD). Each pre-trained token embedding vector is converted into a lower-dimensional Matrix Product State (MPS). We comprehensively evaluate the extracted low-rank structures across compression ratio, language task performance, latency, and energy consumption on a typical low-end device, i.e. Raspberry Pi. Taking the sub-billion parameter versions of GPT-2/Cerebres-GPT and OPT models as examples, our approach achieves a comparable language task performance to the original model with around $2.0\times$ embedding layer compression, while the energ
This paper presents a complete, end-to-end on-device vision machine learning pipeline, comprising data acquisition, two-layer CNN training with Adam optimization, and real-time inference, executing entirely on a microcontroller-class device costing $15-40 USD. Unlike cloud-based workflows that require external infrastructure and conceal the computational pipeline from the practitioner, this system implements every step of the core ML lifecycle in approximately 1,750 lines of readable C++ that compiles in under one minute using the Arduino IDE, with no external ML dependencies. Running on the Seeed Studio ESP32-S3 XIAO ML Kit (8 MB PSRAM), the firmware achieves three-class 64x64 image classification in approximately 9 minutes per training run, with real-time inference at 6.3 FPS. Key contributions include: correct batch-level gradient accumulation; pre-computed resize lookup tables for inference; dual-format weight export for SD-free baked-in deployment; a three-tier weight priority system (SD binary > baked-in header > He-initialization) resolved automatically at boot; a single-constant network reconfiguration interface; and PSRAM-aware memory management suited to microcontro
The Internet-of-Things (IoT) has brought in new challenges in, device identification --what the device is, and, authentication --is the device the one it claims to be. Traditionally, the authentication problem is solved by means of a cryptographic protocol. However, the computational complexity of cryptographic protocols and/or scalability problems related to key management, render almost all cryptography based authentication protocols impractical for IoT. The problem of device identification is, on the other hand, sadly neglected. We believe that device fingerprinting can be used to solve both these problems effectively. In this work, we present a methodology to perform device behavioral fingerprinting that can be employed to undertake device type identification. A device behavior is approximated using features extracted from the network traffic of the device. These features are used to train a machine learning model that can be used to detect similar device types. We validate our approach using five-fold cross validation; we report a identification rate of 86-99% and a mean accuracy of 99%, across all our experiments. Our approach is successful even when a device uses encrypted c
Large language models are increasingly used as orchestrators of external tools via the Model Context Protocol (MCP), but MCP is built for software services with megabytes of memory and does not descend to the microcontrollers that dominate the long tail of physical devices. Recent work (IoT-MCP) ports MCP to edge gateways at 74 KB peak memory; this still excludes the smallest commodity MCUs and, critically, does not address the safety problem of giving an unreliable caller (an LLM that may hallucinate or be prompt-injected) direct control of physical hardware. We present the Device Context Protocol (DCP): a sub-50-byte typical frame (6-byte header + CBOR payload + optional 16-byte HMAC), a manifest schema in which capability scoping, range and type checks, dry-run evaluation, and units-as-types are protocol-layer primitives, and a host-side Bridge that rejects malformed or hallucinated calls before any byte reaches the device. Reference firmware measures 27.6 KB flash / 0.6 KB RAM on ESP32; the Python Bridge, ESP32 firmware, and a language-neutral conformance suite are MIT-licensed and public. An empirical study -- 675 tool calls produced by five LLMs across four vendors (DeepSeek,
Driven by the advancements in generative AI, large machine learning models have revolutionized domains such as image processing, audio synthesis, and speech recognition. While server-based deployments remain the locus of peak performance, the imperative for on-device inference, necessitated by privacy and efficiency considerations, persists. Recognizing GPUs as the on-device ML accelerator with the widest reach, we present ML Drift--an optimized framework that extends the capabilities of state-of-the-art GPU-accelerated inference engines. ML Drift enables on-device execution of generative AI workloads which contain 10 to 100x more parameters than existing on-device generative AI models. ML Drift addresses intricate engineering challenges associated with cross-GPU API development, and ensures broad compatibility across mobile and desktop/laptop platforms, thereby facilitating the deployment of significantly more complex models on resource-constrained devices. Our GPU-accelerated ML/AI inference engine achieves an order-of-magnitude performance improvement relative to existing open-source GPU inference engines.
We present NeuroSPICE, a physics-informed neural network (PINN) framework for device and circuit simulation. Unlike conventional SPICE, which relies on time-discretized numerical solvers, NeuroSPICE leverages PINNs to solve circuit differential-algebraic equations (DAEs) by minimizing the residual of the equations through backpropagation. It models device and circuit waveforms using analytical equations in time domain with exact temporal derivatives. While PINNs do not outperform SPICE in speed or accuracy during training, they offer unique advantages such as surrogate models for design optimization and inverse problems. NeuroSPICE's flexibility enables the simulation of emerging devices, including highly nonlinear systems such as ferroelectric memories.
We present the motivation and early tests for a novel solar instrument that will harness the new High Efficiency Pixel (HEP) Texas Instruments DLP801RE Digital Micromirror Device (DMD) as a reconfigurable spatial light modulator. This design enables real-time, dynamic configuration of the field of view for targeted spectroscopy of magnetically active regions and full-disk observations. Optical efficiency was validated through simulations and laser testing. Destructive window removal allowed for detailed structural analysis, confirming the elimination of central vias present in previous models. We measured a contrast ratio of 250:1, currently limited by the evaluation board's duty cycle rather than the DMD itself. Furthermore, we successfully simulated artificial planetary transits, recovering depths ranging from gas giants to a 40 ppm rocky planet transit. These results demonstrate the HEP DMD's potential for high-precision solar and exoplanetary science applications.
Deep learning techniques have achieved specific results in recording device source identification. The recording device source features include spatial information and certain temporal information. However, most recording device source identification methods based on deep learning only use spatial representation learning from recording device source features, which cannot make full use of recording device source information. Therefore, in this paper, to fully explore the spatial information and temporal information of recording device source, we propose a new method for recording device source identification based on the fusion of spatial feature information and temporal feature information by using an end-to-end framework. From a feature perspective, we designed two kinds of networks to extract recording device source spatial and temporal information. Afterward, we use the attention mechanism to adaptively assign the weight of spatial information and temporal information to obtain fusion features. From a model perspective, our model uses an end-to-end framework to learn the deep representation from spatial feature and temporal feature and train using deep and shallow loss to joint
Device-cloud collaboration holds promise for deploying large language models (LLMs), leveraging lightweight on-device models for efficiency while relying on powerful cloud models for superior reasoning. A central challenge in this setting is determining, for each incoming query, whether it should be processed locally or offloaded to the cloud. Existing approaches typically rely on external routers, which often struggle to determine difficulty from the prompt itself, especially for tasks involving complex reasoning. Motivated by this limitation, we propose enabling on-device LLMs to decide internally whether to invoke cloud assistance at inference time, with this capability instilled through reinforcement learning based post-training. Casting on-device LLM post-training as a reward maximization problem, we design hierarchical rewards to encourage local problem solving and judicious cloud offloading. To solve the resulting problem, we develop an algorithm featuring a group-level policy gradient that stabilizes optimization, together with adaptive prompt filtering that provides complementary learning signals to mitigate policy collapse (i.e., exclusive local execution or exclusive clo
Real-time Automatic Speech Recognition (ASR) is a fundamental building block for many commercial applications of ML, including live captioning, dictation, meeting transcriptions, and medical scribes. Accuracy and latency are the most important factors when companies select a system to deploy. We present WhisperKit, an optimized on-device inference system for real-time ASR that significantly outperforms leading cloud-based systems. We benchmark against server-side systems that deploy a diverse set of models, including a frontier model (OpenAI gpt-4o-transcribe), a proprietary model (Deepgram nova-3), and an open-source model (Fireworks large-v3-turbo).Our results show that WhisperKit matches the lowest latency at 0.46s while achieving the highest accuracy 2.2% WER. The optimizations behind the WhisperKit system are described in detail in this paper.
Device-independent quantum cryptographic schemes aim to guarantee security to users based only on the output statistics of any components used, and without the need to verify their internal functionality. Since this would protect users against untrustworthy or incompetent manufacturers, sabotage or device degradation, this idea has excited much interest, and many device-independent schemes have been proposed. Here we identify a critical weakness of device-independent protocols that rely on public communication between secure laboratories. Untrusted devices may record their inputs and outputs and reveal information about them via publicly discussed outputs during later runs. Reusing devices thus compromises the security of a protocol and risks leaking secret data. Possible defences include securely destroying or isolating used devices. However, these are costly and often impractical. We propose other more practical partial defences as well as a new protocol structure for device-independent quantum key distribution that aims to achieve composable security in the case of two parties using a small number of devices to repeatedly share keys with each another (and no other party).
Symbiotic radio (SR) is a promising technique to support cellular Internet-of-Things (IoT) by forming a mutualistic relationship between IoT and cellular transmissions. In this paper, we propose a novel multi-user multi-IoT-device SR system to enable massive access in cellular IoT. In the considered system, the base station (BS) transmits information to multiple cellular users, and a number of IoT devices simultaneously backscatter their information to these users via the cellular signal. The cellular users jointly decode the information from the BS and IoT devices. Noting that the reflective links from the IoT devices can be regarded as the channel uncertainty of the direct links, we apply the robust design method to design the beamforming vectors at the BS. Specifically, the transmit power is minimized under the cellular transmission outage probability constraints and IoT transmission sum rate constraints. The algorithm based on semi-definite programming and difference-of-convex programming is proposed to solve the power minimization problem. Moreover, we consider a special case where each cellular user is associated with several adjacent IoT devices and propose a direction of ar
The intensive computational burden of Stable Diffusion (SD) for text-to-image generation poses a significant hurdle for its practical application. To tackle this challenge, recent research focuses on methods to reduce sampling steps, such as Latent Consistency Model (LCM), and on employing architectural optimizations, including pruning and knowledge distillation. Diverging from existing approaches, we uniquely start with a compact SD variant, BK-SDM. We observe that directly applying LCM to BK-SDM with commonly used crawled datasets yields unsatisfactory results. It leads us to develop two strategies: (1) leveraging high-quality image-text pairs from leading generative models and (2) designing an advanced distillation process tailored for LCM. Through our thorough exploration of quantization, profiling, and on-device deployment, we achieve rapid generation of photo-realistic, text-aligned images in just two steps, with latency under one second on resource-limited edge devices.
In this work we propose a novel device for controlling the flow of information using Weyl fermions. Based on a previous work of our group, we show that it is possible to fully control the flow of Weyl fermions on several different channels, by applying an electric field perpendicular to the direction of motion of the particles on each channel. In this way, we can transmit information as logical bits, depending on the existence or not of a Weyl current on each channel. We also show that the response time of this device is exceptionally low, less than 1 ps, for typical values of its parameters, allowing the control of the flow of information at extremely high rates, of the order of 100 Petabits per second. Alternatively, this device could also operate as an electric field sensor. In addition, we demonstrate that Weyl fermions can be efficiently guided through the proposed device using appropriate magnetic fields. Finally, we discuss some particularly interesting remarks regarding the electromagnetic interactions of high energy particles.
We are witnessing an increasing availability of streaming data that may contain valuable information on the underlying processes. It is thus attractive to be able to deploy machine learning models on edge devices near sensors such that decisions can be made instantaneously, rather than first having to transmit incoming data to servers. To enable deployment on edge devices with limited storage and computational capabilities, the full-precision parameters in standard models can be quantized to use fewer bits. The resulting quantized models are then calibrated using back-propagation and full training data to ensure accuracy. This one-time calibration works for deployments in static environments. However, model deployment in dynamic edge environments call for continual calibration to adaptively adjust quantized models to fit new incoming data, which may have different distributions. The first difficulty in enabling continual calibration on the edge is that the full training data may be too large and thus not always available on edge devices. The second difficulty is that the use of back-propagation on the edge for repeated calibration is too expensive. We propose QCore to enable contin
To generate genuine random numbers, random number generators based on quantum theory are essential. However, ensuring that the process used to produce randomness meets desired security standards can pose challenges for traditional quantum random number generators. This thesis delves into Device Independent (DI) and Semi-Device Independent (semi-DI) protocols of randomness expansion, based on a minimal set of experimentally verifiable security assumptions. The security in DI protocols relies on the violation of Bell inequalities, which certify the quantum behavior of devices. The semi-DI protocols discussed in this thesis require the characterization of only one device - a power meter. These protocols exploit the fact that quantum states can be prepared such that they cannot be distinguished with certainty, thereby creating a randomness resource. In this study, we introduce enhanced DI and semi-DI protocols that surpass existing ones in terms of output randomness rate, security, or in some instances, both. Our analysis employs the Entropy Accumulation Theorem (EAT) to determine the extractable randomness for finite rounds. A notable contribution is the introduction of randomness exp
Non-Gaussian operations, in particular, photon subtraction (PS), have been shown to enhance the performance of various quantum information processing tasks including continuous variable measurement device independent quantum key distribution (CV-MDI-QKD). This work investigates the role of non-Gaussian resource states, namely, the photon subtracted two-mode squeezed coherent (PSTMSC) (which include photon subtracted two-mode squeezed vacuum (PSTMSV) as a special case) states in CV-MDI-QKD. To this end, we derive the Wigner characteristic function for the resource states, from which the covariance matrix and, finally, the secret key rate expressions are extracted. The optimization of the state parameters is undertaken to find the most suitable resource states in this family of states. There have been previous studies on the PSTMSV and PSTMSC states in CV-MDI-QKD that make use of PS operation. We evaluate such proposals and find to our surprise that both PSTMSC and PSTMSV resource states underperform as compared to the TMSV state rendering PS operation and displacement undesirable.