Robots generally have a structure that combines rotational joints and links in a serial fashion. On the other hand, various joint mechanisms are being utilized in practice, such as prismatic joints, closed links, and wire-driven systems. Previous research have focused on individual mechanisms, proposing methods to design robots capable of achieving given tasks by optimizing the length of links and the arrangement of the joints. In this study, we propose a method for the design optimization of robots that combine different types of joints, specifically rotational and prismatic joints. The objective is to automatically generate a robot that minimizes the number of joints and link lengths while accomplishing a desired task, by utilizing a black-box multi-objective optimization approach. This enables the simultaneous observation of a diverse range of body designs through the obtained Pareto solutions. Our findings confirm the emergence of practical and known combinations of rotational and prismatic joints, as well as the discovery of novel joint combinations.
Efficient job-shop scheduling with transportation resources is critical for high-performance manufacturing. With the rise of "decentralized factories", multi-agent reinforcement learning has emerged as a promising approach for the combined scheduling of production and transportation tasks. Prior work has largely focused on developing novel cooperative architectures while overlooking the question of when joint training is necessary. Joint training denotes the simultaneous training of job and automatic guided vehicle scheduling agents, whereas modular training involves independently training each agent followed by post-hoc integration. In this study, we systematically investigate the conditions under which joint training is essential for optimal performance in the job-shop scheduling problem with transportation resources. Through a rigorous sensitivity analysis of resource scarcity and temporal dominance, we quantify the coordination gap -- the performance difference between these two training modalities. In our evaluation, the joint training can produce superior performance compared to the best-performing combinations of dispatching rules and modular training. However, the coordinat
In this paper, we propose an algorithm that allows joint refinement of camera pose and scene geometry represented by decomposed low-rank tensor, using only 2D images as supervision. First, we conduct a pilot study based on a 1D signal and relate our findings to 3D scenarios, where the naive joint pose optimization on voxel-based NeRFs can easily lead to sub-optimal solutions. Moreover, based on the analysis of the frequency spectrum, we propose to apply convolutional Gaussian filters on 2D and 3D radiance fields for a coarse-to-fine training schedule that enables joint camera pose optimization. Leveraging the decomposition property in decomposed low-rank tensor, our method achieves an equivalent effect to brute-force 3D convolution with only incurring little computational overhead. To further improve the robustness and stability of joint optimization, we also propose techniques of smoothed 2D supervision, randomly scaled kernel parameters, and edge-guided loss mask. Extensive quantitative and qualitative evaluations demonstrate that our proposed framework achieves superior performance in novel view synthesis as well as rapid convergence for optimization.
Estimating the 3D poses of hands and objects from a single RGB image is a fundamental yet challenging problem, with broad applications in augmented reality and human-computer interaction. Existing methods largely rely on visual cues alone, often producing results that violate physical constraints such as interpenetration or non-contact. Recent efforts to incorporate physics reasoning typically depend on post-optimization or non-differentiable physics engines, which compromise visual consistency and end-to-end trainability. To overcome these limitations, we propose a novel framework that jointly integrates visual and physical cues for hand-object pose estimation. This integration is achieved through two key ideas: 1) joint visual-physical cue learning: The model is trained to extract 2D visual cues and 3D physical cues, thereby enabling more comprehensive representation learning for hand-object interactions; 2) candidate pose aggregation: A novel refinement process that aggregates multiple diffusion-generated candidate poses by leveraging both visual and physical predictions, yielding a final estimate that is visually consistent and physically plausible. Extensive experiments demons
6G communications systems are expected to integrate radar-like sensing capabilities enabling novel use cases. However, integrated sensing and communications (ISAC) introduces a trade-off between communications and sensing performance because the optimal constellations for each task differ. In this paper, we compare geometric, probabilistic and joint constellation shaping for orthogonal frequency division multiplexing (OFDM)-ISAC systems using an autoencoder (AE) framework. We first derive the constellation-dependent detection probability and propose a novel loss function to include the sensing performance in the AE framework. Our simulation results demonstrate that constellation shaping enables a dynamic trade-off between communications and sensing. Depending on whether sensing or communications performance is prioritized, geometric or probabilistic constellation shaping is preferred. Joint constellation shaping combines the advantages of geometric and probabilistic shaping, significantly outperforming legacy modulation formats.
Solder joint reliability related to failures due to thermomechanical loading is a critically important yet physically complex engineering problem. As a result, simulated behavior is oftentimes computationally expensive. In an increasingly data-driven world, the usage of efficient data-driven design schemes is a popular choice. Among them, Bayesian optimization (BO) with Gaussian process regression is one of the most important representatives. The authors argue that computational savings can be obtained from exploiting thorough surrogate modeling and selecting a design candidate based on multiple acquisition functions. This is feasible due to the relatively low computational cost, compared to the expensive simulation objective. This paper addresses the shortcomings in the adjacent literature by providing and implementing a novel heuristic framework to perform BO with adaptive hyperparameters across the various optimization iterations. Adaptive BO is subsequently compared to regular BO when faced with synthetic objective minimization problems. The results show the efficiency of adaptive BO when compared any worst-performing regular Bayesian schemes. As an engineering use case, the so
Recovering high-resolution structural and compositional information from coherent X-ray measurements involves solving coupled, nonlinear, and ill-posed inverse problems. Ptychography reconstructs a complex transmission function from overlapping diffraction patterns, while X-ray fluorescence provides quantitative, element-specific contrast at lower spatial resolution. We formulate a joint variational framework that integrates these two modalities into a single nonlinear least-squares problem with shared spatial variables. This formulation enforces cross-modal consistency between structural and compositional estimates, improving conditioning and promoting stable convergence. The resulting optimization couples complementary contrast mechanisms (i.e., phase and absorption from ptychography, elemental composition from fluorescence) within a unified inverse model. Numerical experiments on simulated data demonstrate that the joint reconstruction achieves faster convergence, sharper and more quantitative reconstructions, and lower relative error compared with separate inversions. The proposed approach illustrates how multimodal variational formulations can enhance stability, resolution, an
Contemporary approaches to solving various problems that require analyzing three-dimensional (3D) meshes and point clouds have adopted the use of deep learning algorithms that directly process 3D data such as point coordinates, normal vectors and vertex connectivity information. Our work proposes one such solution to the problem of positioning body and finger animation skeleton joints within 3D models of human bodies. Due to scarcity of annotated real human scans, we resort to generating synthetic samples while varying their shape and pose parameters. Similarly to the state-of-the-art approach, our method computes each joint location as a convex combination of input points. Given only a list of point coordinates and normal vector estimates as input, a dynamic graph convolutional neural network is used to predict the coefficients of the convex combinations. By comparing our method with the state-of-the-art, we show that it is possible to achieve significantly better results with a simpler architecture, especially for finger joints. Since our solution requires fewer precomputed features, it also allows for shorter processing times.
Electric Vehicles (EVs) are becoming increasingly prevalent nowadays, with studies highlighting their potential as mobile energy storage systems to provide grid support. Realising this potential requires effective charging coordination, which are often formulated as mixed-integer programming (MIP) problems. However, MIP problems are NP-hard and often intractable when applied to time-sensitive tasks. To address this limitation, we propose a deep learning assisted approach for optimising a day-ahead EV joint routing and scheduling problem with varying number of EVs. This problem simultaneously optimises EV routing, charging, discharging and generator scheduling within a distribution network with renewable energy sources. A convolutional neural network is trained to predict the binary variables, thereby reducing the solution search space and enabling solvers to determine the remaining variables more efficiently. Additionally, a padding mechanism is included to handle the changes in input and output sizes caused by varying number of EVs, thus eliminating the need for re-training. In a case study on the IEEE 33-bus system and Nguyen-Dupuis transportation network, our approach reduced ru
With the advent of Joint Embedding Predictive Architectures (JEPAs), which appear to be more capable than reconstruction-based methods, this paper introduces a novel technique for creating world models using continuous-time dynamic systems from arbitrary observation data. The proposed method integrates sequence embeddings with neural ordinary differential equations (neural ODEs). It employs loss functions that enforce contractive embeddings and Lipschitz constants in state transitions to construct a well-organized latent state space. The approach's effectiveness is demonstrated through the generation of structured latent state-space models for a simple pendulum system using only image data. This opens up a new technique for developing more general control algorithms and estimation techniques with broad applications in robotics.
Conformal prediction provides machine learning models with prediction sets that offer theoretical guarantees, but the underlying assumption of exchangeability limits its applicability to time series data. Furthermore, existing approaches struggle to handle multi-step ahead prediction tasks, where uncertainty estimates across multiple future time points are crucial. We propose JANET (Joint Adaptive predictioN-region Estimation for Time-series), a novel framework for constructing conformal prediction regions that are valid for both univariate and multivariate time series. JANET generalises the inductive conformal framework and efficiently produces joint prediction regions with controlled K-familywise error rates, enabling flexible adaptation to specific application needs. Our empirical evaluation demonstrates JANET's superior performance in multi-step prediction tasks across diverse time series datasets, highlighting its potential for reliable and interpretable uncertainty quantification in sequential data.
We consider the irreducibility of the regular representation of a noncompact semisimpe Lie group $G$ on the Hilbert space of the image of the Joint-Eigenspace Fourier transform on its corresponding symmetric space $G/K.$ The $L^{2}-$decomposition of the Joint-Eigenspace Fourier transform leads to the complete characterization of the said irreducibility in terms of the simplicity of a pair of members of $\mathfrak{a}^{*}_{\mathbb{C}}.$
Flux-ratio anomalies in quadruply imaged quasars are sensitive to the imprint of low-mass dark-matter haloes. The reliability of detection depends on the robustness of the smooth mass model. Optical surveys show that massive early-type galaxies similar to galaxy-scale gravitational lenses depart from perfect ellipticity, exhibiting $m=3$ and $m=4$ multipole distortions. We construct the semi-analytic, five-dimensional joint population prior for the $m=3$ and $m=4$ amplitude and orientation as well as the axis ratio of the deflector, calibrated on the sample of 840 SDSS E/S0 galaxies. The parameters are fitted via hierarchical Bayesian modeling, minimizing a joint Jensen-Shannon divergence between model and data. We use this prior to model the mass distribution of mock lenses with HST quality data with different multipole amplitudes. We find that we robustly measure the true multipole amplitudes and orientations. Compared to fits that use only the four point-image positions, adding the lensed host-galaxy arcs tightens the 68 % credible regions of multipole parameters by factors of 3-12 and reduces the predicted flux-ratio uncertainties by a mean factor of ~6. This analysis does not
Traffic sign detection is an important research direction in intelligent driving. Unfortunately, existing methods often overlook extreme conditions such as fog, rain, and motion blur. Moreover, the end-to-end training strategy for image denoising and object detection models fails to utilize inter-model information effectively. To address these issues, we propose CCSPNet, an efficient feature extraction module based on Contextual Transformer and CNN, capable of effectively utilizing the static and dynamic features of images, achieving faster inference speed and providing stronger feature enhancement capabilities. Furthermore, we establish the correlation between object detection and image denoising tasks and propose a joint training model, CCSPNet-Joint, to improve data efficiency and generalization. Finally, to validate our approach, we create the CCTSDB-AUG dataset for traffic sign detection in extreme scenarios. Extensive experiments have shown that CCSPNet achieves state-of-the-art performance in traffic sign detection under extreme conditions. Compared to end-to-end methods, CCSPNet-Joint achieves a 5.32% improvement in precision and an 18.09% improvement in mAP@.5.
Using unsupervised learning to disentangle speech into content, rhythm, pitch, and timbre for voice conversion has become a hot research topic. Existing works generally take into account disentangling speech components through human-crafted bottleneck features which can not achieve sufficient information disentangling, while pitch and rhythm may still be mixed together. There is a risk of information overlap in the disentangling process which results in less speech naturalness. To overcome such limits, we propose a two-stage model to disentangle speech representations in a self-supervised manner without a human-crafted bottleneck design, which uses the Mutual Information (MI) with the designed upper bound estimator (IFUB) to separate overlapping information between speech components. Moreover, we design a Joint Text-Guided Consistent (TGC) module to guide the extraction of speech content and eliminate timbre leakage issues. Experiments show that our model can achieve a better performance than the baseline, regarding disentanglement effectiveness, speech naturalness, and similarity. Audio samples can be found at https://largeaudiomodel.com/eadvc.
Cell-free massive multiple-input multiple-output (MIMO) is a promising cellular network. In this network, a large number of distributed and multi-antenna access points (APs) jointly serve many single antenna users using the same time-frequency resource. Consequently, it possibly provides a uniform service experience to users regardless of the users' locations by eliminating interference at cell boundaries via user-centric joint transmission. This joint transmission, however, requires extremely high signaling overheads for data sharing via backhaul links and causes a high network-wide power consumption. To resolve these problems, in this paper, we present a novel joint transmission method, which is referred to as sparse joint transmission (sparse-JT), for cell-free massive MIMO networks with finite backhaul capacity constraints. Sparse-JT jointly identifies the user-centric cooperative APs sets, precoding vectors for beamforming and compression, and power allocation that maximizes a lower bound of the sum-spectral efficiency under the constraint that a total number of active APs per the joint transmission is sparse. The proposed algorithm guarantees to identify a local-optimal solut
With the recent influx in demand for multi-robot systems throughout industry and academia, there is an increasing need for faster, robust, and generalizable path planning algorithms. Similarly, given the inherent connection between control algorithms and multi-robot path planners, there is in turn an increased demand for fast, efficient, and robust controllers. We propose a scalable joint path planning and control algorithm for multi-robot systems with constrained behaviours based on factor graph optimization. We demonstrate our algorithm on a series of hardware and simulated experiments. Our algorithm is consistently able to recover from disturbances and avoid obstacles while outperforming state-of-the-art methods in optimization time, path deviation, and inter-robot errors. See the code and supplementary video for experiments.
Text-conditioned motion synthesis has made remarkable progress with the emergence of diffusion models. However, the majority of these motion diffusion models are primarily designed for a single character and overlook multi-human interactions. In our approach, we strive to explore this problem by synthesizing human motion with interactions for a group of characters of any size in a zero-shot manner. The key aspect of our approach is the adaptation of human-wise interactions as pairs of human joints that can be either in contact or separated by a desired distance. In contrast to existing methods that necessitate training motion generation models on multi-human motion datasets with a fixed number of characters, our approach inherently possesses the flexibility to model human interactions involving an arbitrary number of individuals, thereby transcending the limitations imposed by the training data. We introduce a novel controllable motion generation method, InterControl, to encourage the synthesized motions maintaining the desired distance between joint pairs. It consists of a motion controller and an inverse kinematics guidance module that realistically and accurately aligns the join
An image processing unit (IPU), or image signal processor (ISP) for high dynamic range (HDR) imaging usually consists of demosaicing, white balancing, lens shading correction, color correction, denoising, and tone-mapping. Besides noise from the imaging sensors, almost every step in the ISP introduces or amplifies noise in different ways, and denoising operators are designed to reduce the noise from these sources. Designed for dynamic range compressing, tone-mapping operators in an ISP can significantly amplify the noise level, especially for images captured in low-light conditions, making denoising very difficult. Therefore, we propose a joint multi-scale denoising and tone-mapping framework that is designed with both operations in mind for HDR images. Our joint network is trained in an end-to-end format that optimizes both operators together, to prevent the tone-mapping operator from overwhelming the denoising operator. Our model outperforms existing HDR denoising and tone-mapping operators both quantitatively and qualitatively on most of our benchmarking datasets.
This article introduces a bistatic joint radar-communication (RadCom) system based on orthogonal frequency-division multiplexing (OFDM). In this context, the adopted OFDM frame structure is described and system model encompassing time, frequency, and sampling synchronization mismatches between the transmitter and receiver of the bistatic system is outlined. Next, the signal processing approaches for synchronization and communication are discussed, and radar sensing processing approaches using either only pilots or a reconstructed OFDM frame based on the estimated receive communication data are presented. Finally, proof-of-concept measurement results are presented to validate the investigated system and a trade-off between frame size and the performance of the aforementioned processing steps is observed.