We propose NVS-HO, the first benchmark designed for novel view synthesis of handheld objects in real-world environments using only RGB inputs. Each object is recorded in two complementary RGB sequences: (1) a handheld sequence, where the object is manipulated in front of a static camera, and (2) a board sequence, where the object is fixed on a ChArUco board to provide accurate camera poses via marker detection. The goal of NVS-HO is to learn a NVS model that captures the full appearance of an object from (1), whereas (2) provides the ground-truth images used for evaluation. To establish baselines, we consider both a classical SfM pipeline and a state-of-the-art pre-trained feed-forward neural network (VGGT) as pose estimators, and train NVS models based on NeRF and Gaussian Splatting. Our experiments reveal significant performance gaps in current methods under unconstrained handheld conditions, highlighting the need for more robust approaches. NVS-HO thus offers a challenging real-world benchmark to drive progress in RGB-based novel view synthesis of handheld objects.
Shooting video with handheld shooting devices often results in blurry frames due to shaking hands and other instability factors. Although previous video deblurring methods have achieved impressive progress, they still struggle to perform satisfactorily on real-world handheld video due to the blur domain gap between training and testing data. To address the issue, we propose a self-supervised method for handheld video deblurring, which is driven by sharp clues in the video. First, to train the deblurring model, we extract the sharp clues from the video and take them as misalignment labels of neighboring blurry frames. Second, to improve the deblurring ability of the model, we propose a novel Self-Enhanced Video Deblurring (SEVD) method to create higher-quality paired video data. Third, we propose a Self-Constrained Spatial Consistency Maintenance (SCSCM) method to regularize the model, preventing position shifts between the output and input frames. Moreover, we construct synthetic and real-world handheld video datasets for handheld video deblurring. Extensive experiments on these and other common real-world datasets demonstrate that our method significantly outperforms existing self
The advent of handheld fundus imaging devices has made ophthalmologic diagnosis and disease screening more accessible, efficient, and cost-effective. However, images captured from these setups often suffer from artifacts such as flash reflections, exposure variations, and motion-induced blur, which degrade image quality and hinder downstream analysis. While generative models have been effective in image restoration, most depend on paired supervision or predefined artifact structures, making them less adaptable to unstructured degradations commonly observed in handheld fundus images. To address this, we propose an unsupervised diffusion autoencoder that integrates a context encoder with the denoising process to learn semantically meaningful representations for artifact restoration. The model is trained only on high-quality table-top fundus images and infers to restore artifact-affected handheld acquisitions. We validate the restorations through quantitative and qualitative evaluations, and have shown that diagnostic accuracy increases to 81.17% on an unseen dataset and multiple artifact conditions
Handheld ultrasound devices face usage limitations due to user inexperience and cannot benefit from supervised deep learning without extensive expert annotations. Moreover, the models trained on standard ultrasound device data are constrained by training data distribution and perform poorly when directly applied to handheld device data. In this study, we propose the Training-free Image Style Alignment (TISA) framework to align the style of handheld device data to those of standard devices. The proposed TISA can directly infer handheld device images without extra training and is suited for clinical applications. We show that TISA performs better and more stably in medical detection and segmentation tasks for handheld device data. We further validate TISA as the clinical model for automatic measurements of spinal curvature and carotid intima-media thickness. The automatic measurements agree well with manual measurements made by human experts and the measurement errors remain within clinically acceptable ranges. We demonstrate the potential for TISA to facilitate automatic diagnosis on handheld ultrasound devices and expedite their eventual widespread use.
Recent advancements have showcased the potential of handheld millimeter-wave (mmWave) imaging, which applies synthetic aperture radar (SAR) principles in portable settings. However, existing studies addressing handheld motion errors either rely on costly tracking devices or employ simplified imaging models, leading to impractical deployment or limited performance. In this paper, we present IFNet, a novel deep unfolding network that combines the strengths of signal processing models and deep neural networks to achieve robust imaging and focusing for handheld mmWave systems. We first formulate the handheld imaging model by integrating multiple priors about mmWave images and handheld phase errors. Furthermore, we transform the optimization processes into an iterative network structure for improved and efficient imaging performance. Extensive experiments demonstrate that IFNet effectively compensates for handheld phase errors and recovers high-fidelity images from severely distorted signals. In comparison with existing methods, IFNet can achieve at least 11.89 dB improvement in average peak signal-to-noise ratio (PSNR) and 64.91% improvement in average structural similarity index measu
Patients with bladder dysfunction often lose the sensation of bladder fullness and cannot void naturally, forcing reliance on fixed-schedule catheterization that is uncomfortable and risks complications. We present WeeCare, a handheld conformable pad with fabric electrodes for on-demand bladder fullness sensing using electrical impedance tomography (EIT). The central challenge is that repeated removal and reattachment can introduce variation in electrode position and contact quality. We assess WeeCare along three axes: in-silico simulations characterizing electrode layout and noise robustness, in-vitro phantom experiments across urine salinities and filling levels, and an in-vivo human measurement for bladder fullness sensing, voiding, and filling dynamics.
The homopolar or unipolar generator, which is sometimes referred to as a Faraday Paradox, is and experiment that shows an apparent contradiction between different predictions for induced emfs. I present a simple, handheld version of the experiment and a suggested resolution.
The design of image reconstruction algorithms for near-range handheld synthetic aperture radar (SAR) systems has gained increasing popularity due to the promising performance of portable millimeter-wave (MMW) imaging devices in various application fields. Time domain imaging algorithms including the backprojection algorithm (BPA) and the Kirchhoff migration algorithm (KMA) are widely adopted due to their direct applicability to arbitrary scan trajectories. However, they suffer from time complexity issues that hinder their practical application. Wavenumber domain algorithms greatly improve the computational efficiency but most of them are restricted to specific array topologies. Based on the factorization techniques as adopted in far-field synthetic aperture radar imaging, the time domain fast factorized backprojection algorithm for handheld synthetic aperture radar (HHFFBPA) is proposed. The local spectral properties of the radar images for handheld systems are analyzed and analytical spectrum compression techniques are derived to realize efficient sampling of the subimages. Validated through numerical simulations and experiments, HHFFBPA achieves fast and accurate 3-D imaging for
Handheld Augmented Reality (HAR) is revolutionizing the civil infrastructure application domain. The current trend in HAR relies on marker tracking technology. However, marker-based systems have several limitations, such as difficulty in use and installation, sensitivity to light, and marker design. In this paper, we propose a markerless HAR framework with GeoPose-based tracking. We use different gestures for manipulation and achieve 7 DOF (3 DOF each for translation and rotation, and 1 DOF for scaling). The proposed framework, called GHAR, is implemented for architectural building models. It augments virtual CAD models of buildings on the ground, enabling users to manipulate and visualize an architectural model before actual construction. The system offers a quick view of the building infrastructure, playing a vital role in requirement analysis and planning in construction technology. We evaluated the usability, manipulability, and comprehensibility of the proposed system using a standard user study with the System Usability Scale (SUS) and Handheld Augmented Reality User Study (HARUS). We compared our GeoPose-based markerless HAR framework with a marker-based HAR framework, findi
Accurately tracking food consumption is crucial for nutrition and health monitoring. Traditional approaches typically require specific camera angles, non-occluded images, or rely on gesture recognition to estimate intake, making assumptions about bite size rather than directly measuring food volume. We propose the FoodTrack framework for tracking and measuring the volume of hand-held food items using egocentric video which is robust to hand occlusions and flexible with varying camera and object poses. FoodTrack estimates food volume directly, without relying on intake gestures or fixed assumptions about bite size, offering a more accurate and adaptable solution for tracking food consumption. We achieve absolute percentage loss of approximately 7.01% on a handheld food object, improving upon a previous approach that achieved a 16.40% mean absolute percentage error in its best case, under less flexible conditions.
Rydberg atoms, due to their large polarizabilities and strong transition dipole moments, have been utilized as sensitive electric field sensors. While their capability to detect modulated signals has been previously demonstrated, these studies have largely been limited to laboratory-generated signals tailored specifically for atomic detection. Here, we extend the practical applicability of Rydberg sensors by demonstrating the reception of real-world frequency-modulated (FM) audio transmissions using a consumer-grade handheld two-way radio operating in the UHF band. Detection is based on the AC Stark shift induced by the radio signal in a Rydberg atomic vapor, with demodulation performed using an offset local oscillator and lock-in amplification. We successfully demodulate speech signals and evaluate the audio spectral response and reception range. We show that all consumer-accessible radio channels can be simultaneously detected, and demonstrate simultaneous reception of two neighboring channels with at least 53 dB of isolation. This work underscores the potential of Rydberg atom-based receivers for practical, real-world FM signal detection.
Movement directly reflects neurological and musculoskeletal health, yet objective biomechanical assessment is rarely available in routine care. We introduce Portable Biomechanics Laboratory (PBL), a secure platform for fitting biomechanical models to video collected with a handheld, moving, smartphone. We validate this approach on over 15 hours of data synchronized to ground truth motion capture, finding mean joint-angle errors < 3$°$ and pelvis-translation errors of a few centimeters across patients with neurological-injury, lower-limb prosthesis users, pediatric in-patients, and controls. In > 5 hours of prospective deployments to neurosurgery and sports-medicine clinics, PBL was easy to setup, yielded highly reliable gait metrics (ICC > 0.9), and detected clinically relevant differences. For cervical-myelopathy patients, its measurement of gait quality correlated with modified Japanese Orthopedic Association (mJOA) scores and were responsive to clinical intervention. Handheld smartphone video can therefore deliver accurate, scalable, and low-burden biomechanical measurement, enabling greatly increased monitoring of movement impairments. We release the first clinically-v
Most digital music tools emphasize precision and control, but often lack support for tactile, improvisational workflows grounded in environmental interaction. Lumia addresses this by enabling users to "compose through looking"--transforming visual scenes into musical phrases using a handheld, camera-based interface and large multimodal models. A vision-language model (GPT-4V) analyzes captured imagery to generate structured prompts, which, combined with user-selected instrumentation, guide a text-to-music pipeline (Stable Audio). This real-time process allows users to frame, capture, and layer audio interactively, producing loopable musical segments through embodied interaction. The system supports a co-creative workflow where human intent and model inference shape the musical outcome. By embedding generative AI within a physical device, Lumia bridges perception and composition, introducing a new modality for creative exploration that merges vision, language, and sound. It repositions generative music not as a task of parameter tuning, but as an improvisational practice driven by contextual, sensory engagement.
We present an intuitive human-drone interaction system that utilizes a gesture-based motion controller to enhance the drone operation experience in real and simulated environments. The handheld motion controller enables natural control of the drone through the movements of the operator's hand, thumb, and index finger: the trigger press manages the throttle, the tilt of the hand adjusts pitch and roll, and the thumbstick controls yaw rotation. Communication with drones is facilitated via the ExpressLRS radio protocol, ensuring robust connectivity across various frequencies. The user evaluation of the flight experience with the designed drone controller using the UEQ-S survey showed high scores for both Pragmatic (mean=2.2, SD = 0.8) and Hedonic (mean=2.3, SD = 0.9) Qualities. This versatile control interface supports applications such as research, drone racing, and training programs in real and simulated environments, thereby contributing to advances in the field of human-drone interaction.
Self-captured full-body videos are popular, but most deployments require mounted cameras, carefully-framed shots, and repeated practice. We propose a more convenient solution that enables full-body video capture using handheld mobile devices. Our approach takes as input two static photos (front and back) of you in a mirror, along with an IMU motion reference that you perform while holding your mobile phone, and synthesizes a realistic video of you performing a similar target motion. We enable rendering into a new scene, with consistent illumination and shadows. We propose a novel video diffusion-based model to achieve this. Specifically, we propose a parameter-free frame generation strategy and a multi-reference attention mechanism to effectively integrate appearance information from both the front and back selfies into the video diffusion model. Further, we introduce an image-based fine-tuning strategy to enhance frame sharpness and improve shadows and reflections generation for more realistic human-scene composition.
Smartphone cameras have become ubiquitous imaging tools, yet their small sensors and compact optics often limit spatial resolution and introduce distortions. Combining information from multiple low-resolution (LR) frames to produce a high-resolution (HR) image has been explored to overcome the inherent limitations of smartphone cameras. Despite the promise of multi-frame super-resolution (MFSR), current approaches are hindered by datasets that fail to capture the characteristic noise and motion patterns found in real-world handheld burst images. In this work, we address this gap by introducing a novel synthetic data engine that uses multi-exposure static images to synthesize LR-HR training pairs while preserving sensor-specific noise characteristics and image motion found during handheld burst photography. We also propose MFSR-GAN: a multi-scale RAW-to-RGB network for MFSR. Compared to prior approaches, MFSR-GAN emphasizes a "base frame" throughout its architecture to mitigate artifacts. Experimental results on both synthetic and real data demonstrates that MFSR-GAN trained with our synthetic engine yields sharper, more realistic reconstructions than existing methods for real-world
Surface visualizations are essential in analyzing three-dimensional spatiotemporal phenomena. Given its ability to provide enhanced spatial perception and scene maneuverability, virtual reality (VR) is an essential medium for surface visualization and interaction tasks. Such tasks primarily rely on visual cues that require an unoccluded view of the surface region under consideration. Haptic force feedback is a tangible interaction modality that alleviates the reliance on visual-only cues by allowing a direct physical sensation of the surface. In this paper, we evaluate the use of a force-based haptic stylus compared to handheld VR controllers via a between-subjects user study involving fundamental interaction tasks performed on surface visualizations. Keeping a consistent visual design across both modalities, our study incorporates tasks that require the localization of the highest, lowest, and random points on surfaces; and tasks that focus on brushing curves on surfaces with varying complexity and occlusion levels. Our findings show that participants took longer to brush curves using the haptic modality but could draw smoother curves compared to the handheld controllers. In contr
We present the description, results, and analysis of the experiments conducted to find the equivalent resolution associated with handheld devices. That is, the resolution from which users stop perceiving quality improvements if better resolutions are presented to them in such devices. Thus, it is the maximum resolution that it is worth considering for generating and delivering video, as long as sequences are not too intensively compressed. Therefore, the detection of the equivalent resolutions allows for notable savings in bandwidth consumption. Subjective assessments have been carried out on fifty subjects using a set of video sequences of very different nature and four handheld devices with a broad range of screen dimensions. The results prove that the equivalent resolution in current handheld devices is 720p as higher resolutions are not valued by users.
Document capture applications on smartphones have emerged as popular tools for digitizing documents. For many individuals, capturing documents with their smartphones is more convenient than using dedicated photocopiers or scanners, even if the quality of digitization is lower. However, using a smartphone for digitization can become excessively time-consuming and tedious when a user needs to digitize a document with multiple pages. In this work, we propose a novel approach to automatically scan multi-page documents from a video stream as the user turns through the pages of the document. Unlike previous methods that required constrained settings such as mounting the phone on a tripod, our technique is designed to allow the user to hold the phone in their hand. Our technique is trained to be robust to the motion and instability inherent in handheld scanning. Our primary contributions in this work include: (1) an efficient, on-device deep learning model that is accurate and robust for handheld scanning, (2) a novel data collection and annotation technique for video document scanning, and (3) state-of-the-art results on the PUCIT page turn dataset.
Handheld kinesthetic haptic interfaces can provide greater mobility and richer tactile information as compared to traditional grounded devices. In this paper, we introduce a new handheld haptic interface which takes input using bidirectional coupled finger flexion. We present the device design motivation and design details and experimentally evaluate its performance in terms of transparency and rendering bandwidth using a handheld prototype device. In addition, we assess the device's functional performance through a user study comparing the proposed device to a commonly used grounded input device in a set of targeting and tracking tasks.